<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Impact of Using a Bilingual Model on Kazakh{Russian Code-Switching Speech Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>STC-innovations Ltd</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>St. Petersburg</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Russia ubskiy@speechpro.com</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ITMO University</institution>
          ,
          <addr-line>St. Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ulm University</institution>
          ,
          <addr-line>Ulm</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Due to the prevalence of bilingualism among Kazakh speakers, code-switching to Russian is common in Kazakh speech. That presents a challenge for monolingual Kazakh-language ASR systems that struggle to transcribe the embedded Russian words. This paper attempts to determine the bene t of bilingual training on matrix language (Kazakh) and embedded language (Russian) monolingual data, as opposed to training on code-switched data only. Speci cally, we evaluate the model's performance on matrix language words and embedded words separately. We make use of two datasets: Kazakh speech with code-switching and Russian speech with no code-switching. We train a monolingual model on each dataset, and a bilingual model on a mixture of the two. The main objective of the experiments is to compare the performance of a model trained on code-switched speech with that of a model trained on full utterances in both languages. Experimental results suggest that bilingual training improves the model's performance on matrix words, and greatly improves its performance on embedded words. We observe an absolute WER improvement of 14.69% in the code-switched words.</p>
      </abstract>
      <kwd-group>
        <kwd>speech recognition</kwd>
        <kwd>code-switching</kwd>
        <kwd>Kazakh language</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Previous attempt at building a bilingual Kazakh{Russian speech recognition
system by Khomitsevich et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has uncovered two main challenges: lack of Kazakh
language resources and large amounts of code-switching to and borrowing from
Russian, a very phonotactically di erent language.
      </p>
      <p>
        Code-switching (also referred to as code-mixing [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) is a practice of
alternating languages within an utterance that is common in bilingual and multilingual
_____________
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
communities. The dominant language in code-switched speech is often referred
to as the matrix language, while the language whose elements are inserted into
the dominant one is referred to as the embedded language [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Since it mostly
occurs in informal conversations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], di culty of recognition of code-switched
speech is compounded by the di culty of conversational speech recognition.
      </p>
      <p>Although most state-of-the-art ASR systems are monolingual, the impact of
code-switching on ASR performance has recently sparked research interest [5{
9]. The success so far has, however, been limited, largely due to the challenges
outlined above.</p>
      <p>
        Due to the majority of Kazakh speakers being bilingual [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], code-switching
occurs commonly in Kazakh conversations. Because of this, it is important that
any automatic speech recognition system deployed for Kazakh language is able
to handle code-switching.
      </p>
      <p>In this paper we attempt to determine the impact of training on both Kazakh
and Russian language data on the quality of speech recognition of the embedded
Russian segments in Kazakh speech.</p>
      <p>The rest of the paper is organized as follows: Section 2 describes the dataset
used in this work. Section 3 describes the model architecture and reports the
experimental results. Finally, Section 4 concludes the paper and discusses the
results.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data</title>
      <p>
        We make use of a proprietary Russian{Kazakh dataset consisting of Kazakh
call centre operator recordings. No data augmentation techniques were used in
the course of this work. Data statistics by language and subset are presented in
Table 1.
The domain of the data is very narrow, containing a signi cant amount of stock
phrases and domain-speci c words. Approximately 10% of words in the Kazakh
language data are code-switched speech. The observed cases of code-switching
include intra-sentential code-switching (insertion of Russian phrases into otherwise
Kazakh sentences), as well as intra-word switching (Russian words conjugated
as if they were Kazakh) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Conversely, the amount of code-switching in the
Russian language data is negligible.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiment</title>
      <p>For training we use 40-dimensional log Mel-scale lter bank energy features with
CMN with rst- and second-order derivatives.</p>
      <p>
        All the ASR systems are built using the Kaldi speech recognition toolkit [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
For each set of data (code-switched Kazakh, Russian, and combined training set)
we train a Deep Neural Network Hidden Markov Models (DNN-HMM) acoustic
model [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The experiments are carried out using the nnet3 setup of the Kaldi
toolkit.
      </p>
      <p>(a) Single BLSTM layer
(b) Acoustic model</p>
      <p>For language modeling, all transcripts available for each set of data are
merged and used to train a 3-gram language model. We use graphemic
pronunciation dictionaries when compiling the language model into a WFST-decoder.</p>
      <p>
        Acoustic models based on deep Bidirectional Long Short-Term Memory (BLSTM)
recurrent neural networks have been demonstrated to be highly e ective in
various ASR tasks [14{16]. We use identical BLSTM architecture for the acoustic
model for each set of data. Each has three hidden BLSTM layers with
projections [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The dimension of each cell is 512, and the dimensions of the recurrent
and non-recurrent projections are set to 256. The output layer consists of 6240
units. (See Fig. 1).
      </p>
      <p>
        Each acoustic model is then trained using Natural Gradient for Stochastic
Gradient Descent [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and evaluated on appropriate evaluation data sets.
Evaluation results are presented in Table 2.
      </p>
      <p>As seen in Table 2, the bilingual model displays higher performance on the
Kazakh evaluation set at the expense of signi cant performance loss on the
Russian evaluation set.</p>
      <p>As the Russian language evaluation set contains no code-switched sentences,
it and the Russian monolingual model are not considered further. Instead, we
focus on the Kazakh evaluation set for closer examination.</p>
      <p>To determine the impact of bilingual training on code-switching, we've
collected per-word statistics used in WER calculation (Table 3). Each error|
substitution (S), insertion (I), or deletion (D)|is classi ed based on the language
the word belongs to. Note that substitutions are thus split into two classes:
substitution with a Kazakh word or a Russian word.</p>
      <p>Monolingual (kaz)
Bilingual</p>
      <p>We then calculate WER for matrix and embedded language words separately
(Table 4). For the purposes of this calculation, all substitutions are considered
to belong to the language of the token of the reference transcription.</p>
      <p>The results shown in Table 4 show clear improvement in recognition of the
embedded language words.
In this paper we presented a bilingual Kazakh{Russian speech recognition
system. We observe signi cant WER improvement in the matrix (Kazakh language)
data, and 14.69% absolute WER improvement on the embedded (Russian
language) data. It is worth noting that this is not the case of more data trivially
yielding better results. The bilingual model performs signi cantly worse on the
Russian language data alone.</p>
      <p>The results indicate that multilingual speech recognition systems are
inherently better capable of recognizing code-switched speech than monolingual
systems trained on code-switched speech itself. Future directions include
investigating end-to-end multilingual systems from the point of view of code-switched
segments, developing more sophisticated language modeling for code-switching,
and introducing more than two languages.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>This work was partially nancially supported by the Government of the Russian
Federation (Grant 08-08), and by the grant of Ministry of Education and Science
of the Russian Federation Goszadanie No. 2.13462.2019/13.2.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Khomitsevich</surname>
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendelev</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tomashenko N</surname>
          </string-name>
          . et al.:
          <article-title>A Bilingual Kazakh-Russian System for Automatic Speech Recognition and Synthesis</article-title>
          . In: Ronzhin A.,
          <string-name>
            <surname>Potapova</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fakotakis</surname>
            <given-names>N</given-names>
          </string-name>
          . (
          <article-title>eds) Speech and Computer</article-title>
          .
          <source>SPECOM 2015. Lecture Notes in Computer Science</source>
          , vol
          <volume>9319</volume>
          . Springer, Cham (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Muysken</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>D az</surname>
            ,
            <given-names>C. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muysken</surname>
            ,
            <given-names>P. C.</given-names>
          </string-name>
          :
          <article-title>Bilingual speech: A typology of codemixing</article-title>
          (Vol.
          <volume>11</volume>
          ). Cambridge University Press. (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Myers-Scotton</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Duelling Languages: Grammatical Structure in Codeswitching Oxford: Clarendon Press,
          <volume>20</volume>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Sitaram</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chandu</surname>
            ,
            <given-names>K. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rallabandi</surname>
            ,
            <given-names>S. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Black</surname>
            ,
            <given-names>A. W.:</given-names>
          </string-name>
          <article-title>A Survey of Codeswitched Speech and Language Processing</article-title>
          . arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>00784</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Vu</surname>
            ,
            <given-names>N.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lyu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiner</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Telaar</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schlippe</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blaicher</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siong</surname>
            ,
            <given-names>C.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schultz</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>A rst speech recognition system for Mandarin-English codeswitch conversational speech</article-title>
          .
          <source>IN: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <volume>4889</volume>
          {
          <fpage>4892</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Modipa</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davel</surname>
            ,
            <given-names>M.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wet</surname>
            ,
            <given-names>F.D.</given-names>
          </string-name>
          :
          <article-title>Implications of Sepedi/English code switching for ASR systems</article-title>
          .
          <source>In: Conference Proceedings of the 24th Annual Symposium of the Pattern Recognition Association of South Africa</source>
          , Johannesburg, South Africa (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lyudovyk</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pylypenko</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Code-Switching speech recognition for closely related languages</article-title>
          .
          <source>SLTU</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Yilmaz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heuvel</surname>
            ,
            <given-names>H.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leeuwen</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          :
          <article-title>Investigating Bilingual Deep Neural Networks for Automatic Recognition of Code-switching Frisian Speech</article-title>
          .
          <source>SLTU</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Biswas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wet</surname>
            ,
            <given-names>F.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Westhuizen</surname>
            ,
            <given-names>E.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yilmaz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niesler</surname>
            ,
            <given-names>T.R.</given-names>
          </string-name>
          :
          <article-title>Multilingual Neural Network Acoustic Modelling for ASR of Under-Resourced English-isiZulu Code-Switched Speech</article-title>
          .
          <source>INTERSPEECH</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Pavlenko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Russian in post-Soviet countries Russ</article-title>
          . linguist.
          <volume>32</volume>
          (
          <issue>1</issue>
          ),
          <volume>59</volume>
          {
          <fpage>80</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Myers-Scotton</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Codeswitching with English: types of switching, types of communities</article-title>
          .
          <source>World Englishes</source>
          <volume>8</volume>
          ,
          <issue>333</issue>
          {
          <fpage>346</fpage>
          (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Povey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          et al.:
          <article-title>The Kaldi Speech Recognition Toolkit</article-title>
          .
          <source>In: IEEE workshop on Automatic Speech Recognition and Understanding (ASRU)</source>
          , pp.
          <volume>1</volume>
          {
          <issue>4</issue>
          .
          <string-name>
            <given-names>Big</given-names>
            <surname>Island</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          et al.:
          <article-title>Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups</article-title>
          .
          <source>Signal Processing Magazine</source>
          , IEEE,
          <volume>29</volume>
          (
          <issue>6</issue>
          ),
          <volume>82</volume>
          {
          <fpage>97</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9(8)</source>
          ,
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaitly</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Hybrid speech recognition with deep bidirectional LSTM</article-title>
          .
          <source>In: IEEE workshop on Automatic Speech Recognition and Understanding (ASRU)</source>
          , pp.
          <volume>55</volume>
          {
          <fpage>59</fpage>
          .
          <string-name>
            <surname>Scottsdale</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seide</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Droppo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stolcke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zweig</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Penn</surname>
          </string-name>
          , G.:
          <article-title>Deep Bi-directional Recurrent Networks Over Spectral Windows In: 2013</article-title>
          <source>IEEE Workshop on Automatic Speech Recognition and Understanding</source>
          , pp.
          <volume>273</volume>
          {
          <fpage>278</fpage>
          .
          <string-name>
            <surname>Olomouc</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Sak</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senior</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beaufays</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition</article-title>
          .
          <source>arXiv preprint arXiv:1402.1128</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Povey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khudanput</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Parallel Training of DNNs with Natural Gradient and Parameter Averaging</article-title>
          .
          <source>arXiv preprint arXiv:1410.7455</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>