=Paper=
{{Paper
|id=Vol-1649/37
|storemode=property
|title=Automatic Symbol Processing for Language Model Building in Slavic Languages
|pdfUrl=https://ceur-ws.org/Vol-1649/37.pdf
|volume=Vol-1649
|authors=Josef Chaloupka
|dblpUrl=https://dblp.org/rec/conf/itat/Chaloupka16
}}
==Automatic Symbol Processing for Language Model Building in Slavic Languages==
<pdf width="1500px">https://ceur-ws.org/Vol-1649/37.pdf</pdf>
<pre>
ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 37–41
http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 J. Chaloupka


             Automatic Symbol Processing for Language Model Building in Slavic
                                       Languages

                                                             Josef Chaloupka

                                          The Institute of Information Technology and Electronics,
                               Technical University of Liberec, Studentska 2, 461 17, Liberec, Czech Republic
                                                        josef.chaloupka@tul.cz,
                                      WWW home page: https://www.ite.tul.cz/speechlabe/

     Abstract: When we want to adapt an existing automatic                   A straightforward solution to this problem does not ex-
     speech recognition system to a new language, we need a               ist. In non-inflected languages (e.g. in English), this is
     large corpus of texts to create a lexicon, a language model          often solved by a translation table (for the symbols) or a
     and a database of annotated recordings to train an acous-            translation generator (for the digit strings). This approach
     tic model. Usually the texts in the corpus (or in annota-            is sometimes used also for the inflectional languages to-
     tions) contain not only words but also some other symbols,           gether with some simplified (e.g. majority based) rules. It
     mainly strings of digits, special characters and some fre-           is also possible to ignore the symbols and just skip them
     quent abbreviations of units. The common feature of all              during the LM calculation. The latter approach has sev-
     these symbols is that there is not a straightforward corre-          eral risks, though. The most dangerous one is that some
     spondence between their printed form and the spoken one.             words may never appear in the text form (e.g. ‘Celsius‘ or
     The main goal of this work was to develop efficient tools            some less frequent names of numbers) and therefore they
     for automatic translation of symbols or symbolic terms to            will not be included in the lexicon, and hence they cannot
     words for almost all Slavic languages. In this paper we              be recognized by the ASR (Automatic Speech Recogni-
     present the research of the basic elements and the produc-           tion) system. And this may happen also when they are in
     tion rules in Slavic languages which was used for design             the lexicon but not (or just poorly) represented by the LM.
     of our universal text pre- and post-processing tools.                Since the digits and the terms represented by the frequent
                                                                          symbols play an important role in the information carried
                                                                          by speech, it is necessary to find an appropriate solution to
     1    Introduction                                                    this problem.

     The systems for automatic continuous speech recognition              2   Motivation and Context of our Research
     are developed for different languages at present. Most
     of these systems use for classification an acoustic model            Recently, we have been developing a multi-lingual broad-
     (AM) together with a language model (LM) and a lexicon               cast monitoring system that employs an ASR technology
     [1, 2]. It is necessary to have a large audio database with          we had built previously [4]. In a rather short period we
     audio recordings and text transcriptions for the training of         need to build language specific modules (lexicons, LMs
     AM. We also need large text corpora for the calculation of           and AMs) for more than 10 languages. As all of them
     LM. The problem is that very often some special symbols              belong to the Slavic family, our task is some-what easier
     occur in transcribed text or in text corpora very often [3].         because we can benefit from several facts. All the Slavic
     There are about 2 to 5% of such cases in our text data. In           languages are more or less related and share many com-
     many cases, the symbols include strings of digits, special           mon features, they use similar patterns in grammar and in
     characters (%, e, $, . . . ) or some frequent abbreviations          morphology and what is very important, they can be mu-
     of physical units (km, kg, ◦ C, . . . ).                             tually understood – at least to some extent and after a short
        When we build a lexicon, these symbols are usually                reading and listening training.
     omitted. In case of digits, it would be impossible to have              For each language, we need to solve the same or very
     all their combinations in the vocabulary. As to the other            similar tasks, and the symbol and digit processing is one
     symbols, it would be impractical to keep both abbreviated            of them. Therefore, we decided to design a set of uni-
     and full forms there. In inflected languages, the problem            versal transcription tools that will allow to avoid routine
     is even more complex. A digit or a string of them can be             tasks that would be otherwise repeated for each language.
     translated into several different words or word combina-             We have defined the following goals for the transcription:
     tions, depending on the context. The word corresponding,             They should be able to generate basic (and with some ex-
     e.g. to digit ’2’, could be either cardinal or ordinal number,       tensions also several declined) text forms of cardinal num-
     it can change its suffix according to the gender and case of         bers. The same should be available also for ordinal num-
     the related word (typically a noun), it can be a part of a           bers. A special tool should process digit strings that spec-
     decimal number, etc. For the symbols similar rules apply.            ify dates and years, and another tool will process numbers
38                                                                                                                         J. Chaloupka

     with a decimal point. The last type of tools should be fo-      to 9 are gender independent (GI). It would be possible to
     cused on the most frequent abbreviations (physical units,       generate numbers from eleven to nineteen (number 1-9 +
     currencies, etc.).                                              teen) or tens (10, 20, . . . ), which are formed by adding
        If we want to design and use the tools efficiently, we       ‘ten‘ to the end of the digit root or hundreds (100, 200,
     need to find the patterns and features that are com-mon ei-     . . . ), which are formed the same way as the tens by adding
     ther for all Slavic languages or at least for some of them      the suffix hundred. But we work with them like with spe-
     (CZ – Czech, SK – Slovak, PL – Polish, RU – Russian, BY         cific words because there are several exception in different
     – Belarusian, UA – Ukrainian, HR – Croatian, RS – Ser-          SLang. Only SK and SL have the least exceptions and we
     bian, SL – Slovenian, BG – Bulgarian and MK – Macedo-           can generate hundreds without exception. Several differ-
     nian). These Slavic languages have been selected because        ent patterns (systems of rules) are in SLang for numbers
     we have designed and implemented a complex system for           from twenty-one to ninety-nine:
     automatic broadcast programs transcription for these lan-
     guages [5, 6, 7] or we are modifying the system for them          • DU: the ten ’D - Decade’ comes first, then the Unit
     (UA, BY). The resulting recognition rate of our transcrip-          ’U’
     tion system is over 80% for all the mentioned languages.
     It is a relatively good result but we are continuously im-        • D_U: the same as the first one but the space ’_’ is
     proving the transcription system for each Slavic language.          between the D and U
        After that we need to define the basic elements (prim-
     itives) and the production rules, which will allow us to          • D_&_U: the spaces and word ’and - &’ (e.g. CZ - a,
     translate almost any digit string, any date and year, or any        PL - i, SL - in) are between D and U
     decimal number to the correct text form in each of the
     given languages. The tools are essential for several practi-      • U&D: U comes first, then D, joined together by the
     cal tasks, namely:                                                  word ’and - &’.
         • To ensure that the words usually represented by their
           symbols appear in the lexicon.                               Only the patterns which are being used most often in
                                                                     single SLang are shown in Table 1. We know that in dif-
         • To translate symbols and symbolic terms to words          ferent SLang there are also alternative (or minor) patterns
           (text pre-processing) and when needed also back to        for conversion of digits (21 - 99) to word numbers (e.g.
           symbols (post-processing).                                CZ - U&D) but we have not considered them in this work.

         • To enhance the LM by adding translated forms into
           the corpus. The enhancement can be done also by                       Table 1: Cardinal number patterns
           generating randomly chosen digit and symbol strings             Numbers Pattern              Language
           using the rules and patterns applicable for each lan-              1, 2         GD              All
           guage.                                                             3, 4         GD              SK
                                                                             21-99        D_U      CZ, RU, UA, BY, PL
         • To enhance acoustic model training by better and
                                                                                           DU              SK
           more correct annotation of speech data, employing
                                                                                        D_&_U        HR, RS, BG, MK
           the transcription tools and allowing them to use alter-
                                                                                          U&D              SL
           native, minor or even colloquial rules of transcription
                                                                            hundreds        -               -
           and pronunciation.
                                                                           thousands       3F        CZ, PL, RU, UA,
                                                                                                       BY, HR, RS
     3     Symbol to Text Translation                                                      2F            BG, MK
                                                                                           1F            SK, SL
     In this work, symbol to text translation is solved for car-            millions       3F          CZ, SK, PL,
     dinal, ordinal, decimal numbers or dates or for a cardi-                                        RU, UA, BY, SL
     nal/decimal number in combination with an abbreviation.                               2F        HR, RS, BG, MK
                                                                            milliards      3F        CZ, SK, PL, RU,
                                                                                                   UA, BY, HR, RS, SL
     3.1    Cardinal Numbers                                                               2F            BG, MK
     The first task was to find how to convert a string of dig-
     its to a word representation for different Slavic languages       The conversion of numbers larger than a thousand is
     (SLang). This task is relatively easy in English but it is      once again specific. Being gendered, all the Higher Scale
     more complicated for SLang. The words for numbers one           Names (HSN - thousands, millions, milliard, . . . ) follow
     and two (+ three, four in SK) are inflected by gender -         the declension rules in different SLang. They are three
     gender dependent (GD), other numbers from 3 (5 in SK)           main patterns for the conversion of HSN:
Automatic Symbol Processing for Language Model Building in Slavic Languages                                                            39

        • 3F: three different word forms (1 HSN, 2-4 HSN and
                                                                                    Table 2: Ordinal number patterns
          more than 4 HSN, e.g. CZ - jeden milion (one mil-
                                                                              Pattern                Language
          lion), dva miliony (two millions), pět milionů (five
          millions)).                                                           AO                  CZ, SK, PL
                                                                                LO      RU, UA, BY, HR, RS, SL, BG, MK
        • 2F: two different word forms (1 HSN, more than 1
          HSN)
                                                                        3.3   Decimal Numbers
        • 1F: one word form (without declension)
                                                                        Two main decimal marks (separator) are used to separate
        Several large-number naming system exist for numbers
                                                                        the Integer part (I) from the Fractional part (F) of a deci-
     greater than million. We are using the long scale (LS) and
                                                                        mal number. The decimal comma is used as decimal mark
     short scale (SS) systems in the Europe. In the LS, every
                                                                        in all SLang. Only in HR language can be found in text
     new term greater than million is one million times bigger
                                                                        corpuses decimal comma or decimal point. The decimal
     than the previous term (e.g. billion means a million mil-
                                                                        comma is read as whole (w) (e.g. SL - cela), comma (c)
     lions) and every new term greater than million is one thou-
                                                                        (e.g. HR - zarez) or as and (&) (e.g. PL - i). The word
     sand times bigger than the previous term in the SS (e.g.
                                                                        - name of the last digit’s place value (DN) can be used in
     billion means a thousand millions). The LS is used in CZ,
                                                                        decimal number conversion (e.g. tenths, hundredths, thou-
     SK, PL, HR, RS and the SS system in RU, UA, BY, BG
                                                                        sandths, ten-thousandths, hundred-thousandth, millionth).
     and MK but with one exception: milliard in the LS system
                                                                           The patterns for translation of decimal numbers (digits)
     is billion in the SS system. Slavic countries with the SS
                                                                        to words are as follows:
     system use the word milliard (from LS) instead of billion.
     Other names for higher numbers are from the SS system
                                                                          • W_w_F(DN): e.g. CZ - dvě celé šest setin - two
     (trillion, quadrillion, . . . ). The strings of digits with val-
                                                                            whole six hundredths
     ues higher than milliard are very rare in our text corpora
     therefore they are not solved in this work.                          • W_&_F(DN): e.g. PL - dwa i sześć setnych - two and
        The patterns (Table 1.) and a list of words represent-              six hundredths
     ing the names of numbers are necessary for the conversion
     from strings of digits to words or for generation of num-            • W_w_&_F(DN): e.g. BG - два цяло и шест стот-
     bers for training LM of any SLang. The list of words of                ни - two and six hundredths
     numbers is larger than in English but it is still relatively
     small, e.g. we need only 49 words for the generation of              • W_c_F: e.g. MK - два запирка нула еден - two
     any cardinal number from zero to several milliards in CZ.              comma zero six

                                                                            The alternative patterns for decimal numbers (digits)
     3.2   Ordinal Numbers                                              conversion exist in several SLang, e.g. pattern W_w_&_F
     The ordinal numbers are presented in text (in almost all           in UA and PL. But only main patterns are used in our tran-
     SLang) by strings of digits where dot ’.’ is the last char-        scription system at present.
     acter. The exception is ordinal number (without dot) in                The word whole (e.g. SL – cela) and DN are inflected in
     date, see chapter 3.4. Two patterns exist for translation          SLang. There are three word forms for word whole in CZ
     of strings of digits to words if the string is higher ordinal      and SK, two word forms in RU and SL and only one in BG.
     number, e.g. 21.:                                                  In SK and CZ, the number placed before the comma is fol-
                                                                        lowed by the first word form for numbers ending by one,
        • AO: All word number forms are Ordinal (e.g. PL -              by the second word form for numbers from two to four and
          dwudziesty pierwszy (twentieth first))                        the third word form for all other numbers. In RU and SL,
                                                                        it is first word form of word whole for numbers ending in
        • LO: only Last member is Ordinal, other words are
                                                                        one and second word form for other words. Inflection of
          cardinal numbers (e.g. HR - dvadeset i prvi (twenty
                                                                        words DN is more complicated. There are several differ-
          and first)))
                                                                        ent exceptions here, so we have a special list of DN words
        Other rules for combination of words in ordinal num-            for the group of numbers for each SLang.
     bers are the same as for cardinal numbers. The combi-
     nation of digits and abbreviation are used for writing of                     Table 3: Decimal number patterns
     ordinal numbers in English language very often, e.g. 1st                         Pattern          Language
     - first. Similar writing pattern appears in text in RU, BY
                                                                                    W_w_F(DN)        CZ, SK, RU, SL
     and MK , e.g. RU - 1-й (первый - first). We solved it in
     RU (BY, MK) by simple lookup table. In other SLang, the                        W_&_F(DN)         PL, UA, BY
     combination of digits and abbreviation doesn’t exist or it                    W_w_&_F(DN) BG, (PL), (UA)
     is very rare.                                                                    W_c_F           HR, RS, MK
40                                                                                                                              J. Chaloupka

       The last exception in SLang (which use word ’whole’ in           abbreviation, e.g. BG - г.) precedes or follows the string.
     decimal numbers) is that the most common form for read-            The year is usually cardinal (CZ, SK, SL) or ordinal num-
     ing a decimal number beginning by zero is to read only             ber (PL, RU, UA, BY, HR, RS, BG, MK). There are several
     the fractional part (F) together with DN, e.g. CZ - 0,21 -         exceptions for the transcription of the date and the year in
     dvacet jedna setin - twenty one hundredths.                        different SLangs therefore our tools use only the main pat-
                                                                        terns (forms).
                                                                           CZ has one specific: years above one thousand and be-
     3.4   Dates and Years
                                                                        low two thousand are read as multiples of the word one
     The date occurs frequently in text corpora in the form of          hundred, e.g. 1900 - devatenáct set - nineteen hundred.
     strings of digits, e.g. 8. 5. 1945, or in a combination of
     strings of digits with the name of the month, e.g. PL - 8          3.5   Combination of digits and abbreviation
     maja 1945. The main format of date is day-month-year
     in all SLang. We decide that some strings are date if two          The last task was to translate a string of digits followed by
     strings of digits followed by dots (ordinal numbers) are           an abbreviation to words in the text. In our case, the ab-
     next to each, e.g. 8. 5., or if the string of digit precedes the   breviation were special characters ’e’, ’$’ or ’%’ and ab-
     name of the month. In different SLang a string of digits           breviations of physical units ’km’, ’l’, ’kg’, ’◦ C’ or ’m/s’.
     preceding a dot precedes the name of the month (CZ, SK,            This task is relatively easy. The number (a string of dig-
     HR, RS, SL), e. g. CZ - 8. května, or we have only a              its) before the abbreviation is a cardinal number and there
     string of digits without a dot (PL, RU, UA, BY, BG, MK),           are three (3F) or two (2F) word forms of abbreviation. The
     e.g. PL - 8 maja. It is necessary to know that string of           first word form is in combination with number one, second
     digit without dot before the name of the month is still an         for numbers from two to four and third for numbers higher
     ordinal number. Latin-derived names of months are used             than four in 3F, e.g. SL - en kilometer (one kilometer) ,
     in SK, RU, RS, SL, BG, MK; a set of older names for the            dva kilometra (two kilometers), pet kilometrov (five kilo-
     months that differs from the Latin month names is used in          meters). In 2F, the first word form is for abbreviation in
     CZ, PL, UA, BY, HR.                                                combination with number one (singular) and second word
        In our tools, we solve the day together with the month          form is for numbers higher than one (plural). Pattern 2F
     and year separately. There are two possible readings of            is the same as in English. There are several exceptions for
     date strings in SLang, e.g. ’1. 1.’ - ’first first’ or ’first      the inflection of some abbreviations in pattern 3F or 2F in
     January’. The words for ordinal numbers are inflected by           different SLang. For example, the word euro (’e’) isn’t
     case (N - nominative, G - genitive, . . . ) in the first ap-       inflected in PL, RU, UA, BY, BG, MK and pattern 2F (not
     proach (’first first’). There isn’t any inflection by case in      3F) is used in HR and RS.
     BG and MK, therefore the words stay in their basic form
     (B_B). There are three possible patterns, e.g.:
                                                                        Table 4: Inflection of abbreviation with combination of
       • G_N: e.g. CZ - prvního (first - genitive) první (first -       digit string
         nominative)                                                        Pattern                 Language
                                                                                3F     CZ, SK, PL, SL, RU, UA, BY, HR, RS
       • N_N: e.g. PL - pierwszy (first - nominative) pierwszy                  2F                   BG, MK
         (first - nominative)

       • G_G: e.g. HR - prvog (first - genitive) prvog (first -
         genitive)
                                                                        4     Discussion and Practical Applications
       • B_B: e.g. BG - първи (first) първи (first)
                                                                        We have developed the universal program tool for trans-
        Pattern G_N occurs in CZ, SK, N_N in PL, G_G in HR,             lating symbols (mainly digits) and symbolic terms to
     RS and B_B in BG and MK. Otherwise, the first approach             words (text pre-processing) and back (post-processing).
     is very rare or unusual in other SLang and the second ap-          The pre(post)-processing from this tool is used on our
     proach is more common.                                             databases (text corpora or annotated audio recordings) to
        Both words, the ordinal number presenting day and the           train AM or calculating LM. This tool is also possible
     name of month, are in genitive in the second approach              to use as a random or interval generator of word strings
     (’first January’) in SLang (without BG and MK - they               (cardinal, ordinal, decimal numbers or dates or cardi-
     don’t have cases). The ordinal number has to be in nomi-           nal/decimal numbers with abbreviation). The generator
     native if the name of month is in nominative, but this ap-         is useful for re-training LM. The input to this tool is a
     proach is less common in all SLang.                                XML file for different SLangs and several parameters
        The string of digits is detected as a year in the text if: 1)   which are represented by the patterns described above.
     the name of the month precedes, 2) two short (1 - 31(12))          All important information for transcription is saved in the
     ordinal numbers precede, 3) some form of word year (or             XML file, e.g. 1 – one, 2 – two, 1. – first, 2. – second . . .
Automatic Symbol Processing for Language Model Building in Slavic Languages                                                                 41

                                                                        References
     Example for CZ: CZ.XML –F T D_U GD 2 AO
     W_w_F -DN Yes -ZERO Yes -Year 11 CN                                [1] Chong, T., Y., Banchs, R. E., Chng, E., S., Li, H.: TDTO
                                                                            Language Modeling with Feedforward Neural Networks. In
     where: [-F T] Function: T translation, G generator,                    Proc. of Interspeech 2015, Dresden, Germany, p. 1458-1462,
     [GD 2] digits 1 and 2 are not transcribed – we cannot                  2015.
     solve transcription of GD cardinal numbers at present,             [2] Loof, J., Gollan, C., Ney, H.: Cross-language bootstrapping
     [–Year 11 CN] year is cardinal number (CN) and 11                      for unsupervised acoustic model training: Rapid develop-
                                                                            ment of a Polish speech recognition system. In Proc. of In-
     indicates that years above 1000 and below 2000 are read
                                                                            terspeech 2009, UK, p. 88-91, 2009.
     as multiples of the word one hundred. Parameter 10 is set
                                                                        [3] Vasserman, L., Schogol, V., Hall, K.: Sequence-based Class
     for all other SLangs, [-DN Yes] parameter for decimal
                                                                            Tagging for Robust Transcription in ASR, In Interspeech
     numbers – the name of the last digit’s place value is
                                                                            2015, p. 473-477, 2015.
     used, e.g. 0,25 - nula celá dvacet pět setin, [-DN No] e.g.
                                                                        [4] Nouza, J., Cerva, P., Kucharova, M.: Cost-Efficient Devel-
     nula celá dvacet pět, [-ZERO Yes] parameter for decimal
                                                                            opment of Acoustic Models for Speech Recognition of Re-
     numbers – first word is zero for 0,. . . , e.g. nula celá dvacet       lated Languages, In Radioengineering, vol. 22, no. 3, p. 866-
     pět setin, [-ZERO No] e.g. dvacet pět setin.                         873, ISSN 1210-2512, 2013.
        It is very easy to generate word strings from                   [5] Cerva, P., Nouza, J., Silovsky J.: Study on Cross-Lingual
     minor patterns by parameter settings in different                      Adaptation of a Czech LVCSR System towards Slovak.
     SLangs.        The translation tool is available on-line:              Springer Verlag, Vol. 6800, p. 81-87, 2011.
     http://kvap.tul.cz/slavic_symbols.php and it is still being        [6] Nouza, J., Cerva, P., Zdansky, J., Kucharova, M.: A study
     improved by the help of native speakers.                               on adapting Czech automatic speech recognition system to
                                                                            Croatian language. In Proc. of Elmar 2012. Zadar (Croatia),
     5    Conclusion and Future Work                                        p. 227-230, 2012.
                                                                        [7] Nouza, J., Cerva, P., Safarik, R.: Cross-Lingual Adaptation
     We have defined several patterns for the translation of any            of Broadcast Transcription System to Polish Language Us-
     digit string in texts of almost all Slavic languages. The              ing Public Data Sources, In LTC 2015, Poland, p. 181-185,
     digit strings are a cardinal, ordinal, decimal number or               ISBN 978-83-932640-8-7, 2015.
     date or it is a number in combination with abbreviation.
     The rules are relatively complex but we have focused pri-
     marily on the main patterns because we need it for build-
     ing systems for the automatic transcription of broadcast
     programs. The people speak mainly formal and they use
     official patterns in their speech. Our text corpora mostly
     consist from news and there is formal language too. The
     patterns described in this paper are used to develop tools
     for translation of symbols to words in pre-processing and
     also in post-processing of text. The main application area
     for these tools is the enhancement of language models or
     improvement of speech data annotation for training the
     acoustic model. The tools have been designed and imple-
     mented in the same way for all Slavic languages. Only the
     patterns as parameters and lexicon are changed for each
     Slavic language in the tools.
        We would like to find the probability of alternative or
     minor patterns in our audio recordings in the near future.
     These alternative patterns will be used for random gener-
     ation of words from symbols in the process of language
     model re-training. The main patterns will still be used for
     symbol to word translation in text pre- or post-processing
     because otherwise the resulting error rate could be possi-
     bly higher than improvements by translation tools.

     6    Acknowledgments
     The research described in this paper was supported by the
     Technology Agency of the Czech Republic (project no.
     TA04010199).

</pre>