=Paper=
{{Paper
|id=Vol-1649/37
|storemode=property
|title=Automatic Symbol Processing for Language Model Building in Slavic Languages
|pdfUrl=https://ceur-ws.org/Vol-1649/37.pdf
|volume=Vol-1649
|authors=Josef Chaloupka
|dblpUrl=https://dblp.org/rec/conf/itat/Chaloupka16
}}
==Automatic Symbol Processing for Language Model Building in Slavic Languages==
ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 37–41 http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 J. Chaloupka Automatic Symbol Processing for Language Model Building in Slavic Languages Josef Chaloupka The Institute of Information Technology and Electronics, Technical University of Liberec, Studentska 2, 461 17, Liberec, Czech Republic josef.chaloupka@tul.cz, WWW home page: https://www.ite.tul.cz/speechlabe/ Abstract: When we want to adapt an existing automatic A straightforward solution to this problem does not ex- speech recognition system to a new language, we need a ist. In non-inflected languages (e.g. in English), this is large corpus of texts to create a lexicon, a language model often solved by a translation table (for the symbols) or a and a database of annotated recordings to train an acous- translation generator (for the digit strings). This approach tic model. Usually the texts in the corpus (or in annota- is sometimes used also for the inflectional languages to- tions) contain not only words but also some other symbols, gether with some simplified (e.g. majority based) rules. It mainly strings of digits, special characters and some fre- is also possible to ignore the symbols and just skip them quent abbreviations of units. The common feature of all during the LM calculation. The latter approach has sev- these symbols is that there is not a straightforward corre- eral risks, though. The most dangerous one is that some spondence between their printed form and the spoken one. words may never appear in the text form (e.g. ‘Celsius‘ or The main goal of this work was to develop efficient tools some less frequent names of numbers) and therefore they for automatic translation of symbols or symbolic terms to will not be included in the lexicon, and hence they cannot words for almost all Slavic languages. In this paper we be recognized by the ASR (Automatic Speech Recogni- present the research of the basic elements and the produc- tion) system. And this may happen also when they are in tion rules in Slavic languages which was used for design the lexicon but not (or just poorly) represented by the LM. of our universal text pre- and post-processing tools. Since the digits and the terms represented by the frequent symbols play an important role in the information carried by speech, it is necessary to find an appropriate solution to 1 Introduction this problem. The systems for automatic continuous speech recognition 2 Motivation and Context of our Research are developed for different languages at present. Most of these systems use for classification an acoustic model Recently, we have been developing a multi-lingual broad- (AM) together with a language model (LM) and a lexicon cast monitoring system that employs an ASR technology [1, 2]. It is necessary to have a large audio database with we had built previously [4]. In a rather short period we audio recordings and text transcriptions for the training of need to build language specific modules (lexicons, LMs AM. We also need large text corpora for the calculation of and AMs) for more than 10 languages. As all of them LM. The problem is that very often some special symbols belong to the Slavic family, our task is some-what easier occur in transcribed text or in text corpora very often [3]. because we can benefit from several facts. All the Slavic There are about 2 to 5% of such cases in our text data. In languages are more or less related and share many com- many cases, the symbols include strings of digits, special mon features, they use similar patterns in grammar and in characters (%, e, $, . . . ) or some frequent abbreviations morphology and what is very important, they can be mu- of physical units (km, kg, ◦ C, . . . ). tually understood – at least to some extent and after a short When we build a lexicon, these symbols are usually reading and listening training. omitted. In case of digits, it would be impossible to have For each language, we need to solve the same or very all their combinations in the vocabulary. As to the other similar tasks, and the symbol and digit processing is one symbols, it would be impractical to keep both abbreviated of them. Therefore, we decided to design a set of uni- and full forms there. In inflected languages, the problem versal transcription tools that will allow to avoid routine is even more complex. A digit or a string of them can be tasks that would be otherwise repeated for each language. translated into several different words or word combina- We have defined the following goals for the transcription: tions, depending on the context. The word corresponding, They should be able to generate basic (and with some ex- e.g. to digit ’2’, could be either cardinal or ordinal number, tensions also several declined) text forms of cardinal num- it can change its suffix according to the gender and case of bers. The same should be available also for ordinal num- the related word (typically a noun), it can be a part of a bers. A special tool should process digit strings that spec- decimal number, etc. For the symbols similar rules apply. ify dates and years, and another tool will process numbers 38 J. Chaloupka with a decimal point. The last type of tools should be fo- to 9 are gender independent (GI). It would be possible to cused on the most frequent abbreviations (physical units, generate numbers from eleven to nineteen (number 1-9 + currencies, etc.). teen) or tens (10, 20, . . . ), which are formed by adding If we want to design and use the tools efficiently, we ‘ten‘ to the end of the digit root or hundreds (100, 200, need to find the patterns and features that are com-mon ei- . . . ), which are formed the same way as the tens by adding ther for all Slavic languages or at least for some of them the suffix hundred. But we work with them like with spe- (CZ – Czech, SK – Slovak, PL – Polish, RU – Russian, BY cific words because there are several exception in different – Belarusian, UA – Ukrainian, HR – Croatian, RS – Ser- SLang. Only SK and SL have the least exceptions and we bian, SL – Slovenian, BG – Bulgarian and MK – Macedo- can generate hundreds without exception. Several differ- nian). These Slavic languages have been selected because ent patterns (systems of rules) are in SLang for numbers we have designed and implemented a complex system for from twenty-one to ninety-nine: automatic broadcast programs transcription for these lan- guages [5, 6, 7] or we are modifying the system for them • DU: the ten ’D - Decade’ comes first, then the Unit (UA, BY). The resulting recognition rate of our transcrip- ’U’ tion system is over 80% for all the mentioned languages. It is a relatively good result but we are continuously im- • D_U: the same as the first one but the space ’_’ is proving the transcription system for each Slavic language. between the D and U After that we need to define the basic elements (prim- itives) and the production rules, which will allow us to • D_&_U: the spaces and word ’and - &’ (e.g. CZ - a, translate almost any digit string, any date and year, or any PL - i, SL - in) are between D and U decimal number to the correct text form in each of the given languages. The tools are essential for several practi- • U&D: U comes first, then D, joined together by the cal tasks, namely: word ’and - &’. • To ensure that the words usually represented by their symbols appear in the lexicon. Only the patterns which are being used most often in single SLang are shown in Table 1. We know that in dif- • To translate symbols and symbolic terms to words ferent SLang there are also alternative (or minor) patterns (text pre-processing) and when needed also back to for conversion of digits (21 - 99) to word numbers (e.g. symbols (post-processing). CZ - U&D) but we have not considered them in this work. • To enhance the LM by adding translated forms into the corpus. The enhancement can be done also by Table 1: Cardinal number patterns generating randomly chosen digit and symbol strings Numbers Pattern Language using the rules and patterns applicable for each lan- 1, 2 GD All guage. 3, 4 GD SK 21-99 D_U CZ, RU, UA, BY, PL • To enhance acoustic model training by better and DU SK more correct annotation of speech data, employing D_&_U HR, RS, BG, MK the transcription tools and allowing them to use alter- U&D SL native, minor or even colloquial rules of transcription hundreds - - and pronunciation. thousands 3F CZ, PL, RU, UA, BY, HR, RS 3 Symbol to Text Translation 2F BG, MK 1F SK, SL In this work, symbol to text translation is solved for car- millions 3F CZ, SK, PL, dinal, ordinal, decimal numbers or dates or for a cardi- RU, UA, BY, SL nal/decimal number in combination with an abbreviation. 2F HR, RS, BG, MK milliards 3F CZ, SK, PL, RU, UA, BY, HR, RS, SL 3.1 Cardinal Numbers 2F BG, MK The first task was to find how to convert a string of dig- its to a word representation for different Slavic languages The conversion of numbers larger than a thousand is (SLang). This task is relatively easy in English but it is once again specific. Being gendered, all the Higher Scale more complicated for SLang. The words for numbers one Names (HSN - thousands, millions, milliard, . . . ) follow and two (+ three, four in SK) are inflected by gender - the declension rules in different SLang. They are three gender dependent (GD), other numbers from 3 (5 in SK) main patterns for the conversion of HSN: Automatic Symbol Processing for Language Model Building in Slavic Languages 39 • 3F: three different word forms (1 HSN, 2-4 HSN and Table 2: Ordinal number patterns more than 4 HSN, e.g. CZ - jeden milion (one mil- Pattern Language lion), dva miliony (two millions), pět milionů (five millions)). AO CZ, SK, PL LO RU, UA, BY, HR, RS, SL, BG, MK • 2F: two different word forms (1 HSN, more than 1 HSN) 3.3 Decimal Numbers • 1F: one word form (without declension) Two main decimal marks (separator) are used to separate Several large-number naming system exist for numbers the Integer part (I) from the Fractional part (F) of a deci- greater than million. We are using the long scale (LS) and mal number. The decimal comma is used as decimal mark short scale (SS) systems in the Europe. In the LS, every in all SLang. Only in HR language can be found in text new term greater than million is one million times bigger corpuses decimal comma or decimal point. The decimal than the previous term (e.g. billion means a million mil- comma is read as whole (w) (e.g. SL - cela), comma (c) lions) and every new term greater than million is one thou- (e.g. HR - zarez) or as and (&) (e.g. PL - i). The word sand times bigger than the previous term in the SS (e.g. - name of the last digit’s place value (DN) can be used in billion means a thousand millions). The LS is used in CZ, decimal number conversion (e.g. tenths, hundredths, thou- SK, PL, HR, RS and the SS system in RU, UA, BY, BG sandths, ten-thousandths, hundred-thousandth, millionth). and MK but with one exception: milliard in the LS system The patterns for translation of decimal numbers (digits) is billion in the SS system. Slavic countries with the SS to words are as follows: system use the word milliard (from LS) instead of billion. Other names for higher numbers are from the SS system • W_w_F(DN): e.g. CZ - dvě celé šest setin - two (trillion, quadrillion, . . . ). The strings of digits with val- whole six hundredths ues higher than milliard are very rare in our text corpora therefore they are not solved in this work. • W_&_F(DN): e.g. PL - dwa i sześć setnych - two and The patterns (Table 1.) and a list of words represent- six hundredths ing the names of numbers are necessary for the conversion from strings of digits to words or for generation of num- • W_w_&_F(DN): e.g. BG - два цяло и шест стот- bers for training LM of any SLang. The list of words of ни - two and six hundredths numbers is larger than in English but it is still relatively small, e.g. we need only 49 words for the generation of • W_c_F: e.g. MK - два запирка нула еден - two any cardinal number from zero to several milliards in CZ. comma zero six The alternative patterns for decimal numbers (digits) 3.2 Ordinal Numbers conversion exist in several SLang, e.g. pattern W_w_&_F The ordinal numbers are presented in text (in almost all in UA and PL. But only main patterns are used in our tran- SLang) by strings of digits where dot ’.’ is the last char- scription system at present. acter. The exception is ordinal number (without dot) in The word whole (e.g. SL – cela) and DN are inflected in date, see chapter 3.4. Two patterns exist for translation SLang. There are three word forms for word whole in CZ of strings of digits to words if the string is higher ordinal and SK, two word forms in RU and SL and only one in BG. number, e.g. 21.: In SK and CZ, the number placed before the comma is fol- lowed by the first word form for numbers ending by one, • AO: All word number forms are Ordinal (e.g. PL - by the second word form for numbers from two to four and dwudziesty pierwszy (twentieth first)) the third word form for all other numbers. In RU and SL, it is first word form of word whole for numbers ending in • LO: only Last member is Ordinal, other words are one and second word form for other words. Inflection of cardinal numbers (e.g. HR - dvadeset i prvi (twenty words DN is more complicated. There are several differ- and first))) ent exceptions here, so we have a special list of DN words Other rules for combination of words in ordinal num- for the group of numbers for each SLang. bers are the same as for cardinal numbers. The combi- nation of digits and abbreviation are used for writing of Table 3: Decimal number patterns ordinal numbers in English language very often, e.g. 1st Pattern Language - first. Similar writing pattern appears in text in RU, BY W_w_F(DN) CZ, SK, RU, SL and MK , e.g. RU - 1-й (первый - first). We solved it in RU (BY, MK) by simple lookup table. In other SLang, the W_&_F(DN) PL, UA, BY combination of digits and abbreviation doesn’t exist or it W_w_&_F(DN) BG, (PL), (UA) is very rare. W_c_F HR, RS, MK 40 J. Chaloupka The last exception in SLang (which use word ’whole’ in abbreviation, e.g. BG - г.) precedes or follows the string. decimal numbers) is that the most common form for read- The year is usually cardinal (CZ, SK, SL) or ordinal num- ing a decimal number beginning by zero is to read only ber (PL, RU, UA, BY, HR, RS, BG, MK). There are several the fractional part (F) together with DN, e.g. CZ - 0,21 - exceptions for the transcription of the date and the year in dvacet jedna setin - twenty one hundredths. different SLangs therefore our tools use only the main pat- terns (forms). CZ has one specific: years above one thousand and be- 3.4 Dates and Years low two thousand are read as multiples of the word one The date occurs frequently in text corpora in the form of hundred, e.g. 1900 - devatenáct set - nineteen hundred. strings of digits, e.g. 8. 5. 1945, or in a combination of strings of digits with the name of the month, e.g. PL - 8 3.5 Combination of digits and abbreviation maja 1945. The main format of date is day-month-year in all SLang. We decide that some strings are date if two The last task was to translate a string of digits followed by strings of digits followed by dots (ordinal numbers) are an abbreviation to words in the text. In our case, the ab- next to each, e.g. 8. 5., or if the string of digit precedes the breviation were special characters ’e’, ’$’ or ’%’ and ab- name of the month. In different SLang a string of digits breviations of physical units ’km’, ’l’, ’kg’, ’◦ C’ or ’m/s’. preceding a dot precedes the name of the month (CZ, SK, This task is relatively easy. The number (a string of dig- HR, RS, SL), e. g. CZ - 8. května, or we have only a its) before the abbreviation is a cardinal number and there string of digits without a dot (PL, RU, UA, BY, BG, MK), are three (3F) or two (2F) word forms of abbreviation. The e.g. PL - 8 maja. It is necessary to know that string of first word form is in combination with number one, second digit without dot before the name of the month is still an for numbers from two to four and third for numbers higher ordinal number. Latin-derived names of months are used than four in 3F, e.g. SL - en kilometer (one kilometer) , in SK, RU, RS, SL, BG, MK; a set of older names for the dva kilometra (two kilometers), pet kilometrov (five kilo- months that differs from the Latin month names is used in meters). In 2F, the first word form is for abbreviation in CZ, PL, UA, BY, HR. combination with number one (singular) and second word In our tools, we solve the day together with the month form is for numbers higher than one (plural). Pattern 2F and year separately. There are two possible readings of is the same as in English. There are several exceptions for date strings in SLang, e.g. ’1. 1.’ - ’first first’ or ’first the inflection of some abbreviations in pattern 3F or 2F in January’. The words for ordinal numbers are inflected by different SLang. For example, the word euro (’e’) isn’t case (N - nominative, G - genitive, . . . ) in the first ap- inflected in PL, RU, UA, BY, BG, MK and pattern 2F (not proach (’first first’). There isn’t any inflection by case in 3F) is used in HR and RS. BG and MK, therefore the words stay in their basic form (B_B). There are three possible patterns, e.g.: Table 4: Inflection of abbreviation with combination of • G_N: e.g. CZ - prvního (first - genitive) první (first - digit string nominative) Pattern Language 3F CZ, SK, PL, SL, RU, UA, BY, HR, RS • N_N: e.g. PL - pierwszy (first - nominative) pierwszy 2F BG, MK (first - nominative) • G_G: e.g. HR - prvog (first - genitive) prvog (first - genitive) 4 Discussion and Practical Applications • B_B: e.g. BG - първи (first) първи (first) We have developed the universal program tool for trans- Pattern G_N occurs in CZ, SK, N_N in PL, G_G in HR, lating symbols (mainly digits) and symbolic terms to RS and B_B in BG and MK. Otherwise, the first approach words (text pre-processing) and back (post-processing). is very rare or unusual in other SLang and the second ap- The pre(post)-processing from this tool is used on our proach is more common. databases (text corpora or annotated audio recordings) to Both words, the ordinal number presenting day and the train AM or calculating LM. This tool is also possible name of month, are in genitive in the second approach to use as a random or interval generator of word strings (’first January’) in SLang (without BG and MK - they (cardinal, ordinal, decimal numbers or dates or cardi- don’t have cases). The ordinal number has to be in nomi- nal/decimal numbers with abbreviation). The generator native if the name of month is in nominative, but this ap- is useful for re-training LM. The input to this tool is a proach is less common in all SLang. XML file for different SLangs and several parameters The string of digits is detected as a year in the text if: 1) which are represented by the patterns described above. the name of the month precedes, 2) two short (1 - 31(12)) All important information for transcription is saved in the ordinal numbers precede, 3) some form of word year (or XML file, e.g. 1 – one, 2 – two, 1. – first, 2. – second . . . Automatic Symbol Processing for Language Model Building in Slavic Languages 41 References Example for CZ: CZ.XML –F T D_U GD 2 AO W_w_F -DN Yes -ZERO Yes -Year 11 CN [1] Chong, T., Y., Banchs, R. E., Chng, E., S., Li, H.: TDTO Language Modeling with Feedforward Neural Networks. In where: [-F T] Function: T translation, G generator, Proc. of Interspeech 2015, Dresden, Germany, p. 1458-1462, [GD 2] digits 1 and 2 are not transcribed – we cannot 2015. solve transcription of GD cardinal numbers at present, [2] Loof, J., Gollan, C., Ney, H.: Cross-language bootstrapping [–Year 11 CN] year is cardinal number (CN) and 11 for unsupervised acoustic model training: Rapid develop- ment of a Polish speech recognition system. In Proc. of In- indicates that years above 1000 and below 2000 are read terspeech 2009, UK, p. 88-91, 2009. as multiples of the word one hundred. Parameter 10 is set [3] Vasserman, L., Schogol, V., Hall, K.: Sequence-based Class for all other SLangs, [-DN Yes] parameter for decimal Tagging for Robust Transcription in ASR, In Interspeech numbers – the name of the last digit’s place value is 2015, p. 473-477, 2015. used, e.g. 0,25 - nula celá dvacet pět setin, [-DN No] e.g. [4] Nouza, J., Cerva, P., Kucharova, M.: Cost-Efficient Devel- nula celá dvacet pět, [-ZERO Yes] parameter for decimal opment of Acoustic Models for Speech Recognition of Re- numbers – first word is zero for 0,. . . , e.g. nula celá dvacet lated Languages, In Radioengineering, vol. 22, no. 3, p. 866- pět setin, [-ZERO No] e.g. dvacet pět setin. 873, ISSN 1210-2512, 2013. It is very easy to generate word strings from [5] Cerva, P., Nouza, J., Silovsky J.: Study on Cross-Lingual minor patterns by parameter settings in different Adaptation of a Czech LVCSR System towards Slovak. SLangs. The translation tool is available on-line: Springer Verlag, Vol. 6800, p. 81-87, 2011. http://kvap.tul.cz/slavic_symbols.php and it is still being [6] Nouza, J., Cerva, P., Zdansky, J., Kucharova, M.: A study improved by the help of native speakers. on adapting Czech automatic speech recognition system to Croatian language. In Proc. of Elmar 2012. Zadar (Croatia), 5 Conclusion and Future Work p. 227-230, 2012. [7] Nouza, J., Cerva, P., Safarik, R.: Cross-Lingual Adaptation We have defined several patterns for the translation of any of Broadcast Transcription System to Polish Language Us- digit string in texts of almost all Slavic languages. The ing Public Data Sources, In LTC 2015, Poland, p. 181-185, digit strings are a cardinal, ordinal, decimal number or ISBN 978-83-932640-8-7, 2015. date or it is a number in combination with abbreviation. The rules are relatively complex but we have focused pri- marily on the main patterns because we need it for build- ing systems for the automatic transcription of broadcast programs. The people speak mainly formal and they use official patterns in their speech. Our text corpora mostly consist from news and there is formal language too. The patterns described in this paper are used to develop tools for translation of symbols to words in pre-processing and also in post-processing of text. The main application area for these tools is the enhancement of language models or improvement of speech data annotation for training the acoustic model. The tools have been designed and imple- mented in the same way for all Slavic languages. Only the patterns as parameters and lexicon are changed for each Slavic language in the tools. We would like to find the probability of alternative or minor patterns in our audio recordings in the near future. These alternative patterns will be used for random gener- ation of words from symbols in the process of language model re-training. The main patterns will still be used for symbol to word translation in text pre- or post-processing because otherwise the resulting error rate could be possi- bly higher than improvements by translation tools. 6 Acknowledgments The research described in this paper was supported by the Technology Agency of the Czech Republic (project no. TA04010199).