Impact of Using a Bilingual Model on
        Kazakh–Russian Code-Switching Speech
                     Recognition

    Dmitrii Ubskii1,2,3[0000−0003−1760−6837] , Yuri Matveev2[0000−0001−7010−1585] ,
                    and Wolfgang Minker3[0000−0003−4531−0662]
                     1
                         STC-innovations Ltd, St. Petersburg, Russia
                                  ubskiy@speechpro.com
                         2
                           ITMO University, St. Petersburg, Russia
                                  matveev@mail.ifmo.ru
                             3
                                Ulm University, Ulm, Germany
                               wolfgang.minker@uni-ulm.de


        Abstract. Due to the prevalence of bilingualism among Kazakh speak-
        ers, code-switching to Russian is common in Kazakh speech. That presents
        a challenge for monolingual Kazakh-language ASR systems that struggle
        to transcribe the embedded Russian words.
        This paper attempts to determine the benefit of bilingual training on ma-
        trix language (Kazakh) and embedded language (Russian) monolingual
        data, as opposed to training on code-switched data only. Specifically, we
        evaluate the model’s performance on matrix language words and embed-
        ded words separately.
        We make use of two datasets: Kazakh speech with code-switching and
        Russian speech with no code-switching. We train a monolingual model
        on each dataset, and a bilingual model on a mixture of the two. The
        main objective of the experiments is to compare the performance of a
        model trained on code-switched speech with that of a model trained on
        full utterances in both languages.
        Experimental results suggest that bilingual training improves the model’s
        performance on matrix words, and greatly improves its performance on
        embedded words. We observe an absolute WER improvement of 14.69%
        in the code-switched words.

        Keywords: speech recognition · code-switching · Kazakh language


1     Introduction

Previous attempt at building a bilingual Kazakh–Russian speech recognition sys-
tem by Khomitsevich et al. [1] has uncovered two main challenges: lack of Kazakh
language resources and large amounts of code-switching to and borrowing from
Russian, a very phonotactically different language.
    Code-switching (also referred to as code-mixing [2]) is a practice of alternat-
ing languages within an utterance that is common in bilingual and multilingual
_____________
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
2      D. Ubskii et al.

communities. The dominant language in code-switched speech is often referred
to as the matrix language, while the language whose elements are inserted into
the dominant one is referred to as the embedded language [3]. Since it mostly
occurs in informal conversations [4], difficulty of recognition of code-switched
speech is compounded by the difficulty of conversational speech recognition.
    Although most state-of-the-art ASR systems are monolingual, the impact of
code-switching on ASR performance has recently sparked research interest [5–
9]. The success so far has, however, been limited, largely due to the challenges
outlined above.
    Due to the majority of Kazakh speakers being bilingual [10], code-switching
occurs commonly in Kazakh conversations. Because of this, it is important that
any automatic speech recognition system deployed for Kazakh language is able
to handle code-switching.
    In this paper we attempt to determine the impact of training on both Kazakh
and Russian language data on the quality of speech recognition of the embedded
Russian segments in Kazakh speech.
    The rest of the paper is organized as follows: Section 2 describes the dataset
used in this work. Section 3 describes the model architecture and reports the
experimental results. Finally, Section 4 concludes the paper and discusses the
results.


2   Data

We make use of a proprietary Russian–Kazakh dataset consisting of Kazakh
call centre operator recordings. No data augmentation techniques were used in
the course of this work. Data statistics by language and subset are presented in
Table 1.


                             Table 1: Data breakdown.
                          Language    Subset duration, hrs
                                     training   evaluation
                          Kazakh       97.4         1.2
                          Russian      58.8         1.0


The domain of the data is very narrow, containing a significant amount of stock
phrases and domain-specific words. Approximately 10% of words in the Kazakh
language data are code-switched speech. The observed cases of code-switching in-
clude intra-sentential code-switching (insertion of Russian phrases into otherwise
Kazakh sentences), as well as intra-word switching (Russian words conjugated
as if they were Kazakh) [11]. Conversely, the amount of code-switching in the
Russian language data is negligible.
                     Bilingual Model Kazakh–Russian Code-Switching ASR         3

3   Experiment

For training we use 40-dimensional log Mel-scale filter bank energy features with
CMN with first- and second-order derivatives.
   All the ASR systems are built using the Kaldi speech recognition toolkit [12].
For each set of data (code-switched Kazakh, Russian, and combined training set)
we train a Deep Neural Network Hidden Markov Models (DNN-HMM) acoustic
model [13]. The experiments are carried out using the nnet3 setup of the Kaldi
toolkit.


         (a) Single BLSTM layer                      (b) Acoustic model

                       Fig. 1: Full BLSTM architecture.


    For language modeling, all transcripts available for each set of data are
merged and used to train a 3-gram language model. We use graphemic pronun-
ciation dictionaries when compiling the language model into a WFST-decoder.
    Acoustic models based on deep Bidirectional Long Short-Term Memory (BLSTM)
recurrent neural networks have been demonstrated to be highly effective in var-
ious ASR tasks [14–16]. We use identical BLSTM architecture for the acoustic
model for each set of data. Each has three hidden BLSTM layers with projec-
tions [17]. The dimension of each cell is 512, and the dimensions of the recurrent
and non-recurrent projections are set to 256. The output layer consists of 6240
units. (See Fig. 1).
4        D. Ubskii et al.

   Each acoustic model is then trained using Natural Gradient for Stochastic
Gradient Descent [18] and evaluated on appropriate evaluation data sets. Eval-
uation results are presented in Table 2.


Table 2: WER results for monolingual and bilingual models on Kazakh and
Russian evaluation sets.
                                                    WER, %
                                                  kaz     rus
                            Monolingual (kaz)    52.38    —
                            Monolingual (rus)     —      31.91
                            Bilingual            49.42   37.42
                            Improvement          2.96    –5.51


   As seen in Table 2, the bilingual model displays higher performance on the
Kazakh evaluation set at the expense of significant performance loss on the
Russian evaluation set.
    As the Russian language evaluation set contains no code-switched sentences,
it and the Russian monolingual model are not considered further. Instead, we
focus on the Kazakh evaluation set for closer examination.
    To determine the impact of bilingual training on code-switching, we’ve col-
lected per-word statistics used in WER calculation (Table 3). Each error—
substitution (S), insertion (I), or deletion (D)—is classified based on the language
the word belongs to. Note that substitutions are thus split into two classes: sub-
stitution with a Kazakh word or a Russian word.


                Table 3: Per-word statistics by model and language.
                            Target word     Correct, #               Errors, #
                             language                     S (kaz)   S (rus)     I       D
                              Kazakh            3573       2173       103      490     529
    Monolingual (kaz)
                              Russian           303         235       47       44       55
                              Kazakh            3508       2106       135      311     629
    Bilingual
                              Russian           377         189       30       24       44


   We then calculate WER for matrix and embedded language words separately
(Table 4). For the purposes of this calculation, all substitutions are considered
to belong to the language of the token of the reference transcription.
  The results shown in Table 4 show clear improvement in recognition of the
embedded language words.
                      Bilingual Model Kazakh–Russian Code-Switching ASR             5


 Table 4: WER by model and language of the word in reference transcription.
                                                 WER, %
                                               kaz     rus
                       Monolingual (kaz)      51.66   59.53
                       Bilingual              49.87   44.84
                       Improvement            1.79   14.69


4    Conclusions

In this paper we presented a bilingual Kazakh–Russian speech recognition sys-
tem. We observe significant WER improvement in the matrix (Kazakh language)
data, and 14.69% absolute WER improvement on the embedded (Russian lan-
guage) data. It is worth noting that this is not the case of more data trivially
yielding better results. The bilingual model performs significantly worse on the
Russian language data alone.
    The results indicate that multilingual speech recognition systems are inher-
ently better capable of recognizing code-switched speech than monolingual sys-
tems trained on code-switched speech itself. Future directions include investi-
gating end-to-end multilingual systems from the point of view of code-switched
segments, developing more sophisticated language modeling for code-switching,
and introducing more than two languages.


5    Acknowledgments

This work was partially financially supported by the Government of the Russian
Federation (Grant 08-08), and by the grant of Ministry of Education and Science
of the Russian Federation Goszadanie No. 2.13462.2019/13.2.


References
1. Khomitsevich O., Mendelev V., Tomashenko N. et al.: A Bilingual Kazakh-Russian
   System for Automatic Speech Recognition and Synthesis. In: Ronzhin A., Potapova
   R., Fakotakis N. (eds) Speech and Computer. SPECOM 2015. Lecture Notes in
   Computer Science, vol 9319. Springer, Cham (2015)
2. Muysken, P., Dı́az, C. P., Muysken, P. C.: Bilingual speech: A typology of code-
   mixing (Vol. 11). Cambridge University Press. (2000)
3. Myers-Scotton, C.: Duelling Languages: Grammatical Structure in Codeswitching
   Oxford: Clarendon Press, 20 (1993)
4. Sitaram, S., Chandu, K. R., Rallabandi, S. K., Black, A. W.: A Survey of Code-
   switched Speech and Language Processing. arXiv preprint arXiv:1904.00784 (2019)
5. Vu, N.T., Lyu, D., Weiner, J., Telaar, D., Schlippe, T., Blaicher, F., Siong, C.E.,
   Schultz, T., Li, H.: A first speech recognition system for Mandarin-English code-
   switch conversational speech. IN: IEEE International Conference on Acoustics,
   Speech and Signal Processing (ICASSP), 4889–4892 (2012)
6       D. Ubskii et al.

6. Modipa, T., Davel, M.H., Wet, F.D.: Implications of Sepedi/English code switching
   for ASR systems. In: Conference Proceedings of the 24th Annual Symposium of
   the Pattern Recognition Association of South Africa, Johannesburg, South Africa
   (2013)
7. Lyudovyk, T., Pylypenko, V.: Code-Switching speech recognition for closely related
   languages. SLTU (2014)
8. Yilmaz, E., Heuvel, H.V., Leeuwen, D.A.: Investigating Bilingual Deep Neural Net-
   works for Automatic Recognition of Code-switching Frisian Speech. SLTU (2016)
9. Biswas, A., Wet, F.D., Westhuizen, E.V., Yilmaz, E., Niesler, T.R.: Multilingual
   Neural Network Acoustic Modelling for ASR of Under-Resourced English-isiZulu
   Code-Switched Speech. INTERSPEECH (2018)
10. Pavlenko, A.: Russian in post-Soviet countries Russ. linguist. 32(1), 59–80 (2008)
11. Myers-Scotton, C.: Codeswitching with English: types of switching, types of com-
   munities. World Englishes 8, 333–346 (1989)
12. Povey, D. et al.: The Kaldi Speech Recognition Toolkit. In: IEEE workshop on
   Automatic Speech Recognition and Understanding (ASRU), pp. 1–4. Big Island
   (2011)
13. Hinton, G., Deng, L., Yu, D. et al.: Deep neural networks for acoustic modeling
   in speech recognition: The shared views of four research groups. Signal Processing
   Magazine, IEEE, 29(6), 82–97 (2012)
14. Hochreiter, S. and Schmidhuber, J.: Long short-term memory. Neural computation
   9(8), 1735–1780 (1997)
15. Graves, A., Jaitly, N., Mohamed, A.: Hybrid speech recognition with deep bidi-
   rectional LSTM. In: IEEE workshop on Automatic Speech Recognition and Under-
   standing (ASRU), pp. 55–59. Scottsdale (2015)
16. Mohamed, A., Seide, F., Yu, D., Droppo, J., Stolcke, A., Zweig, G., Penn, G.: Deep
   Bi-directional Recurrent Networks Over Spectral Windows In: 2013 IEEE Workshop
   on Automatic Speech Recognition and Understanding, pp. 273–278. Olomouc (2013)
17. Sak, H., Senior, A., Beaufays, F.: Long short-term memory based recurrent neu-
   ral network architectures for large vocabulary speech recognition. arXiv preprint
   arXiv:1402.1128 (2014)
18. Povey, D., Zhang, X., Khudanput, S.: Parallel Training of DNNs with Natural
   Gradient and Parameter Averaging. arXiv preprint arXiv:1410.7455 (2014)