Introduction

Language identi cation with limited resources

Emilio Sanchis

Mayte Gimenez

0 0 Departament de Sistemes Informatics i Computacio Universitat Politecnica de Valencia , Valencia , Spain

7 10

Language identi cation is an important issue in many speech applications. We address this problem from the point of view of classi cation of sequences of phonemes, given the assumption that each language has its own phonotactic characteristics. In order to achieve this classi cation, we have to decode the speech utterances in terms of phonemes. The set of phonemes must be the same for all the languages, because the goal is to have a comparable representation of the acoustic sequences. We followed two di erent approaches using the same acoustic model: we decode the audio using trigrams of sequences of phonemes and equiprobable unigrams of phonemes as language model. Then a classi cation process based on perplexity is performed.

Introduction Our language identi cation approach

Our proposal to LI is based on modeling sequences of phonetic units that characterize each language we want to identify. The language identi cation process of a spoken utterance is divided into two phases: Acoustic-Phonetic Decoding. The rst phase of the LI process is a phonetic transcription of the spoken utterance which language must be identi ed. In our proposal, this phase is the same for all languages and, therefore, it should be language independent. Phonetic sequence classi cation. Once the spoken utterance is phonetically transcribed, this sequence must be classi ed in order to determine the language of the utterance. A language model of sequences of phonetic units is learned for each language. The selection criterion if based on minimize the perplexity. Let L be the set of languages, li 2 L one of this languages, and s the phonetic unit sequence to classify. The selected language ^l is the one that minimize the expression: ^l = argmin 10 j1sj log p(sjli) li2L (1) where, p(sjli) is the probability of the sequence s assigned by the model representing language li. 3

Resources and Experimentation

This sections describes the resources used, how we learned the language models, and the preliminary experimentation carried out in this work. 3.1

Description of the used corpus

We have used a corpus of 3446 spoken sentences to learn the language models and evaluate our proposal. The sentences were uttered by several native English, French, and Spanish speakers. The distribution of the languages in the corpus was a little unbalanced (1338 in English, 708 in French, and 1400 for Spanish). The domain of the English and French sentences was queries to a information service about timetable and prices of long distance trains. The Spanish sentences were extracted from a unrestricted phonetically balanced corpus. 3.2

Learning the models

As phonetic unit, we have chosen context-dependent phonemes. Speci cally, we have used triphones, ie, phonemes with information about the phonemes that appear to their left and right. We have learned the acoustic models for triphones and the models of sequences of triphones using an independent Spanish corpus. Only triphones for Spanish have been considered in this work. We have used the same set of Spanish triphones for all the experimentation.

We have phonetically transcribed all sentences in the corpus using two di erent Acoustic-Phonetic Decoding modules. In both modules the set of triphones and the acoustics models associated to them were the same; the di erence was the model of sequences of triphones used as language model. The rst APD module used a trigram model of sequences of triphones. To avoid the bias of using for all languages a trigram model of sequence of phonetic units (triphones) learned with Spanish corpus, a second module was learned using an equiprobable unigram model of triphones. This way, all sequences of phonetic units have the same a priori probability. As result, we got six phonetically transcribed utterances sets, two for each considered language using our two di erent APD modules. 3.3

Experimentation

In order to conduct the evaluation of our approach, we split the available corpus by language and use 80% for training the classi cation models, leaving the remaining 20% to evaluate the performance of the system. Since we have two possible di erent APD modules (trigrams and equiprobable unigrams), we were able to learn two set of language models. For each set, we learned an trigram language model for every language we are trying to discriminate.

We used SRILM Toolkit [Sto02] to estimated the phonetic language models of the classi ers and HTK Speech Recognition Toolkit [You06] to perform the phonetic transcriptions.

Two di erent experiments were conducted. The rst experiment consisted of measuring the perplexity of the test sets. Table 1 shows the perplexity for all training and test combinations. Each column corresponds to the test set for a di erent language and using an speci c APD module (Trigrams APD for the APD based on trigrams of phonetic units and Equiprobable APD for the APD based on equiprobable unigrams of phonetic units). In addition, each row corresponds to a classi er learned using the transcriptions of the training sentences of an speci c language using an speci c APD module.

As expected, Table 1 shows a lower perplexity for combinations where the language of the classi er and the language of test are the same. Regarding the APD module, lower perplexity occur when an APD based on

Trigrams APD Equiprobable APD

Language identification with limited resources trigrams is used to transcribe the sentences, specially those in the test set. It seems that, the use of trigrams of phonetic units learned using a corpus only Spanish is not as critic as we a priori expected.

A second experimentation was conducted in order to evaluate the performance of the Language Identi cation system. The global accuracy of the system was 0.841 when Trigram APD module was used and 0.775 when Equiprobable APD module was used. As in the case of perplexity, the best accuracy result is obtained using the Trigram APD module. Table 2 shows the accuracy considering the di erent languages involved. The best results are obtained for Spanish, possibly because the triphones used were just those of Spanish. Although the phonetic similarity between Spanish and French seems bigger than the phonetic similarity between Spanish and English, results for English are better than those obtained for French. This may be due to the greater amount of English sentences available for the experimentation.

Conclusions and future work

In this paper we have presented a preliminary approach to the language identi cation problem. Our proposal is based on the classi cation of sequences of phonemes assuming that each language has its own phonotactic characteristics. The experimentation shows that our approach is able to predict reasonably well the language of the speaker, especially considering the limited resources used. We have many ideas on how to improve the performance of our system, including but not limited to using really language-independent phonetic units, using the recognizer lattices as input to the classi cation system.

Acknowledgements

This work is partially supported by the Spanish MICINN under contract TIN2011-28169-C05-01, Spain. [Pal13] Palacios, C.S., D'Haro, L.F., de Cordoba, R., Caraballo, M.A.: Incorporacion de n-gramas discriminativos para mejorar un reconocedor de idioma fonotactico basado en i-vectores. Procesamiento del Lenguaje Natural 51 (2013) 145{152 [You06] Young, S.J., Evermann, G., Gales, M.J.F., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.C.: The HTK Book, version 3.4. Cambridge University Engineering Department, Cambridge, UK (2006)

[Rod13] Rodr guez- Fuentes , L.J. , Brummer, N., Pen~agarikano, M. , Varona , A. , Bordel , G., D ez , M.: The albayzin 2012 language recognition evaluation . In Bimbot , F., Cerisara , C. , Fougeron , C. , Gravier , G. , Lamel , L. , Pellegrino , F. , Perrier , P., eds.: Interspeech, ISCA ( 2013 ) 1497 { 1501

[Sto02] Stolcke , A. : Srilm - an extensible language modeling toolkit . In: Proc. of Intl. Conf. on Spoken Language . ( 2002 ) 901 { 904