<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Language identi cation with limited resources</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Emilio Sanchis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mayte Gimenez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departament de Sistemes Informatics i Computacio Universitat Politecnica de Valencia</institution>
          ,
          <addr-line>Valencia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <fpage>7</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>Language identi cation is an important issue in many speech applications. We address this problem from the point of view of classi cation of sequences of phonemes, given the assumption that each language has its own phonotactic characteristics. In order to achieve this classi cation, we have to decode the speech utterances in terms of phonemes. The set of phonemes must be the same for all the languages, because the goal is to have a comparable representation of the acoustic sequences. We followed two di erent approaches using the same acoustic model: we decode the audio using trigrams of sequences of phonemes and equiprobable unigrams of phonemes as language model. Then a classi cation process based on perplexity is performed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Our language identi cation approach</title>
      <p>Our proposal to LI is based on modeling sequences of phonetic units that characterize each language we want to
identify. The language identi cation process of a spoken utterance is divided into two phases:
Acoustic-Phonetic Decoding. The rst phase of the LI process is a phonetic transcription of the spoken
utterance which language must be identi ed. In our proposal, this phase is the same for all languages and,
therefore, it should be language independent.
Phonetic sequence classi cation. Once the spoken utterance is phonetically transcribed, this sequence must
be classi ed in order to determine the language of the utterance. A language model of sequences of phonetic
units is learned for each language. The selection criterion if based on minimize the perplexity.
Let L be the set of languages, li 2 L one of this languages, and s the phonetic unit sequence to classify. The
selected language ^l is the one that minimize the expression:
^l = argmin 10 j1sj log p(sjli)
li2L
(1)
where, p(sjli) is the probability of the sequence s assigned by the model representing language li.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Resources and Experimentation</title>
      <p>This sections describes the resources used, how we learned the language models, and the preliminary
experimentation carried out in this work.
3.1</p>
      <sec id="sec-3-1">
        <title>Description of the used corpus</title>
        <p>We have used a corpus of 3446 spoken sentences to learn the language models and evaluate our proposal. The
sentences were uttered by several native English, French, and Spanish speakers. The distribution of the languages
in the corpus was a little unbalanced (1338 in English, 708 in French, and 1400 for Spanish). The domain of the
English and French sentences was queries to a information service about timetable and prices of long distance
trains. The Spanish sentences were extracted from a unrestricted phonetically balanced corpus.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Learning the models</title>
        <p>As phonetic unit, we have chosen context-dependent phonemes. Speci cally, we have used triphones, ie, phonemes
with information about the phonemes that appear to their left and right. We have learned the acoustic models
for triphones and the models of sequences of triphones using an independent Spanish corpus. Only triphones
for Spanish have been considered in this work. We have used the same set of Spanish triphones for all the
experimentation.</p>
        <p>We have phonetically transcribed all sentences in the corpus using two di erent Acoustic-Phonetic Decoding
modules. In both modules the set of triphones and the acoustics models associated to them were the same;
the di erence was the model of sequences of triphones used as language model. The rst APD module used a
trigram model of sequences of triphones. To avoid the bias of using for all languages a trigram model of sequence
of phonetic units (triphones) learned with Spanish corpus, a second module was learned using an equiprobable
unigram model of triphones. This way, all sequences of phonetic units have the same a priori probability.
As result, we got six phonetically transcribed utterances sets, two for each considered language using our two
di erent APD modules.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Experimentation</title>
        <p>In order to conduct the evaluation of our approach, we split the available corpus by language and use 80% for
training the classi cation models, leaving the remaining 20% to evaluate the performance of the system. Since
we have two possible di erent APD modules (trigrams and equiprobable unigrams), we were able to learn two
set of language models. For each set, we learned an trigram language model for every language we are trying to
discriminate.</p>
        <p>We used SRILM Toolkit [Sto02] to estimated the phonetic language models of the classi ers and HTK Speech
Recognition Toolkit [You06] to perform the phonetic transcriptions.</p>
        <p>Two di erent experiments were conducted. The rst experiment consisted of measuring the perplexity of
the test sets. Table 1 shows the perplexity for all training and test combinations. Each column corresponds
to the test set for a di erent language and using an speci c APD module (Trigrams APD for the APD based
on trigrams of phonetic units and Equiprobable APD for the APD based on equiprobable unigrams of phonetic
units). In addition, each row corresponds to a classi er learned using the transcriptions of the training sentences
of an speci c language using an speci c APD module.</p>
        <p>As expected, Table 1 shows a lower perplexity for combinations where the language of the classi er and the
language of test are the same. Regarding the APD module, lower perplexity occur when an APD based on</p>
        <sec id="sec-3-3-1">
          <title>Trigrams APD</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>Equiprobable APD</title>
          <p>Language identification with limited resources
trigrams is used to transcribe the sentences, specially those in the test set. It seems that, the use of trigrams of
phonetic units learned using a corpus only Spanish is not as critic as we a priori expected.</p>
          <p>A second experimentation was conducted in order to evaluate the performance of the Language Identi cation
system. The global accuracy of the system was 0.841 when Trigram APD module was used and 0.775 when
Equiprobable APD module was used. As in the case of perplexity, the best accuracy result is obtained using the
Trigram APD module. Table 2 shows the accuracy considering the di erent languages involved. The best results
are obtained for Spanish, possibly because the triphones used were just those of Spanish. Although the phonetic
similarity between Spanish and French seems bigger than the phonetic similarity between Spanish and English,
results for English are better than those obtained for French. This may be due to the greater amount of English
sentences available for the experimentation.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and future work</title>
      <p>In this paper we have presented a preliminary approach to the language identi cation problem. Our proposal
is based on the classi cation of sequences of phonemes assuming that each language has its own phonotactic
characteristics. The experimentation shows that our approach is able to predict reasonably well the language
of the speaker, especially considering the limited resources used. We have many ideas on how to improve the
performance of our system, including but not limited to using really language-independent phonetic units, using
the recognizer lattices as input to the classi cation system.</p>
      <sec id="sec-4-1">
        <title>Acknowledgements</title>
        <p>This work is partially supported by the Spanish MICINN under contract TIN2011-28169-C05-01, Spain.
[Pal13] Palacios, C.S., D'Haro, L.F., de Cordoba, R., Caraballo, M.A.: Incorporacion de n-gramas
discriminativos para mejorar un reconocedor de idioma fonotactico basado en i-vectores. Procesamiento del
Lenguaje Natural 51 (2013) 145{152
[You06] Young, S.J., Evermann, G., Gales, M.J.F., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason,
D., Povey, D., Valtchev, V., Woodland, P.C.: The HTK Book, version 3.4. Cambridge University
Engineering Department, Cambridge, UK (2006)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Rod13]
          <article-title>Rodr guez-</article-title>
          <string-name>
            <surname>Fuentes</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          , Brummer, N., Pen~agarikano,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Varona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bordel</surname>
          </string-name>
          , G.,
          <string-name>
            <surname>D ez</surname>
          </string-name>
          , M.:
          <article-title>The albayzin 2012 language recognition evaluation</article-title>
          . In
          <string-name>
            <surname>Bimbot</surname>
          </string-name>
          , F.,
          <string-name>
            <surname>Cerisara</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fougeron</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gravier</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lamel</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pellegrino</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrier</surname>
            , P., eds.: Interspeech,
            <given-names>ISCA</given-names>
          </string-name>
          (
          <year>2013</year>
          )
          <volume>1497</volume>
          {
          <fpage>1501</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Sto02]
          <string-name>
            <surname>Stolcke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Srilm - an extensible language modeling toolkit</article-title>
          .
          <source>In: Proc. of Intl. Conf. on Spoken Language</source>
          . (
          <year>2002</year>
          )
          <volume>901</volume>
          {
          <fpage>904</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>