=Paper=
{{Paper
|id=Vol-2604/paper32
|storemode=property
|title=Written Form Extraction of Spoken Numeric Sequences in Speech-to-Text Conversion for Ukrainian
|pdfUrl=https://ceur-ws.org/Vol-2604/paper32.pdf
|volume=Vol-2604
|authors=Mykola Sazhok,Valentyna Robeiko,Ruslan Seliukh,Dmytro Fedoryn,Oleksandr Yukhymenko
|dblpUrl=https://dblp.org/rec/conf/colins/SazhokRSFY20
}}
==Written Form Extraction of Spoken Numeric Sequences in Speech-to-Text Conversion for Ukrainian==
Written Form Extraction of Spoken Numeric Sequences in Speech-to-Text Conversion for Ukrainian Mykola Sazhok1[0000-0003-1169-6851], Valentyna Robeiko2[0000-0003-2266-7650], Ruslan Seliukh1[0000-0003-2230-8746], Dmytro Fedoryn1[0000-0002-4924-225X] and Oleksandr Yukhymenko1[0000-0001-5868-8547] 1 International Research/Training Center for Information Technology and Systems, Kyiv, Ukraine 2 Taras Shevchenko National University, Kyiv, Ukraine {sazhok, valya.robeiko, vxml12, dmytro.fedoryn, enomaj} @gmail.com Abstract. The result of automatic speech-to-text conversion is a sequence of words contained in a working dictionary. Hence each number must be added to the dictionary, which is not feasible. Therefore we need to introduce a post-pro- cessor block extracting numeric sequences by speech recognition response. We describe a sequence-to-sequence converter that is a finite state transducer ini - tially designed to generate phoneme sequences by words for Ukrainian using the expert-specified rules. Further, we apply this model to extract numeric se- quence by speech recognition response considering word sequences as well as time and speaker identity estimations for each word. Finally, we discuss experi- mental results and spot detected problems for further research. Keywords: numeric sequence extraction, speech recognition post-processing, finite state transducer, rule-based conversions 1 Introduction Human speech contains, depending on a domain, a significant amount of numeric se- quences, which express cardinal numbers, time and date, addresses currency expres- sions and so on. A speech-to-text system produces a sequence of items that are, typically, words contained in the system’s dictionary. The system’s productivity depends on the dictionary amount. Taking more space and computational resources, a larger vocabulary induces additional hypotheses, which is a source for error increase. If we consider each number as a valid word, this means that vocabulary expands as much as numbers might by expressed. Therefore, covering numbers between 1 and 1000000 would hypothetically mean that at least a million of words must be intro - duced to the vocabulary. For highly inflective languages, like Ukrainian, this amount is multiplied by the mean number of word forms. Moreover, most of these number are Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). unseen for the component of an ASR model constraining hypothetical word orders. So the data sparsity grows drastically. From the other hand, quite a limited sub-vocabulary of lemmas (stems) is used to compose a spoken numeric. For Ukrainian, 20 lemmas are sufficient to compose spo- ken numbers from 0 to 19, nine stems are used for tens, nine stems are used to express hundreds and, finally, several lemmas name greater digit groups like thousand, mil- lion and billion. Therefore, in speech-to-text output all numbers are spelled as word sequences and finding their numeric form looks as a productive way. The recent works aims to minimize the supervision, which varies much in depen- dence of the specific language [1,2,3]. The models using an end-to-end recurrent neu- ral network are effective for English language, as an example, however, as reported, it does make errors with respect to the numeric value of the expression for highly inflec- tive languages. Even such extremely rare cases would mislead about the message be- ing conveyed that is completely unacceptable. The second type of models uses finite- state transducers constructed with a minimal amount of training data per inflectional form, on average, that is crucial for highly inflective languages like Ukrainian. The referred approaches intensively exploit the number verbalization provided by a text-to-speech system and huge amount of synthesized speech as for the end-to-end model. That is what is paid to minimize the supervision, which requires huge compu- tational resources and is not applicable to the matured and generally more productive HMM/DNN approach [4]. Instead, we retain a reasonable amount of supervision for tuning the finite-state automata based on [5] and use widely available language knowledge. This work reports the current state of the research applied to Ukrainian. 2 Selection of Hypothetical Numeric Subsequences In general, we consider recognition response that includes, beside a word sequence, estimations for beginning and duration of each recognized word as well as speaker di- arization labels. Therefore, we may avoid including into hypothetical numeric word sequences speech and speaker disruptions, since a long pause between speech seg- ments as well as a speaker change likely cut a numeric sequence. Particularly, our as- sumption is that a speaker never continues pronouncing the number started by the pre- vious speaker. Each word is assigned with either numeric or generic or both labels. In Table 1 we can see a sequence of 10 words, (w1, w2, … , w10), recognized in the beginning of the real news episode. The first word meaning “eighteen” starts at 15.08 s and its duration is 0.65 s as estimated by a speech-to-text converter. The second word is ambiguous and means either a number or an inflectional form of “magpie” word. As one can see, in Ukrainian, several numbers are homographs. Also among them are certain forms of words meaning two, three and five. In this work we label such words as a numeric word and include them to hypothetically numeric subsequences. Hence, we selected two numeric word subsequences (w1, w2, w3) and (w7, w8, w9). From the first subsequence we intend to extract numbers, 18 and 45, whereas the sec- ond subsequence contains just one number, 2019. Table 1. Numeric sequence extraction sample analysis. No Start Dura- Word in Ukrainian Explanation in Eng- Labels Intended out- time tion lish put 1 15.08 0.65 вісімнадцята eighteenth numeric 18 numeric; 2 15.74 0.19 сорок fourty; of magpies 45 generic 3 15.95 0.31 п'ять five numeric 4 16.26 0.33 факти “Facts” generic факти 5 16.61 0.48 відкриває is opening generic відкриває 6 17.11 0.23 новий a new generic новий 7 17.34 0.18 дві two numeric 2019 8 17.52 0.36 тисячі thousand numeric 9 17.90 0.73 дев'ятнадцятий nineteenth numeric 10 18.63 0.21 рік year generic рік 3 Rule-Based Sequence-to-Sequence Multidecision Conversion Model A key issue in modeling the conversion between sequences is the question of how we define the correspondence between elements of source and target sequences. We con- N sider a finite sequence of source elements a1 = ( a1 , a2 , .. . , an , . .. , aN ) where each ele- ment is taken from the set of input elements, A. Let us construct the conversion of this sequence to a set of sequences for output elements taken from B. N Consider an elementary correspondence f that maps a subsequence of a1 , starting from its n-th element, to an element from B set or an empty element: N N f ( an )=b, a n ∈Def ( f ) ⊂ A , b∈ B ∪ ∅, 1≤ n≤ N. (1) Note that (1) is applicable only for the specified source sequences. Applying se- N N quences of such functions, f n , to the source subsequence a n we attain a set of target subsequences: N k N k N k N Lk { F ( an )= ( f 1 ( an ) , f 2 (an ) , .. . , f L (a n )) ∈ B ∪ ∅ ,1⩽ k ⩽ K F . k } (2) Here Lk is length of k-th target subsequence and the number of the target subse- quences is KF . Introduced correspondences (2) form F set. Now we define an operation that concatenates over the sets produced by F and G taken from F as all possible combinations of target sequences generated by F fol- lowed by G: u u u v v v F ∘ G= ( f 1 , f 2 , . .. , f L , g1 , g2 , . .. , g L ) , 1⩽u⩽ K F , 1⩽v ⩽ K G . { u v } (3) Additionally, we assume that the connection result is empty if at least one of F and G is empty. Further, we specify ordered correspondences (2) and accomplish them with additional parameters attaining a set: ~ F= ( F i, d , δ ) , F ∈ F ,1 ⩽ i ⩽|F|, 0: [ ] [ ...] Here optional components of the template are shown in square brackets. The termi- nated with ellipsis block might be repeated with different values that will induce mul- tiple decisions. We will refer to examples in Table 3 for rule specification illustra- tions. consists of explicit characters as well as wildcards replacing any one symbol, ?, and one or more symbols, *, like in examples 1 through 3 in Table 2. Furthermore, the expert may define a subset of characters tak- ing them in square brackets (samples 4 and 5). Sequence elements are separated by whitespace. The only is exclusivity introduced in (4). It is denoted as -x and used in the pattern that maps an unspecified element to itself like in example 6. The next pairs of parameters may repeat as many alternative conversions are valid for the source subsequence. value stays for the analysis step introduced in (4), and explains how to generate a target subse- quence. The wildcard ?, as in samples 4 and 5, stay instead the actually matching character, i.e., for sample 5, matching to the source pattern “M 2 c |” will be mapped to “M 2 c 0 d 0 e |”. Table 3. Examples for the rule specification. No Source subse- Step Target sub- Explanation quence sequence words starting with “тисяч”, a form of 1 тисяч* 1 t “thousand”, excluding “тисяч” itself → t words starting with “п’ятис”, a form of 2 п’ятис* 1 5c “five hundred” → 5c words starting with “дванадцят”, a form 3 дванадцят* 1 1d2e of “twelve” → 1d2e inserting a one, 1, before a thousand that 4 [TBM|] t 1 ?1e was not pronounced [TBMt|] zeroing skipped running pair d and e: in- 5 [0123456789] 3 ??c0d0e serts 0d0e after c if c is followed by the c [TBMt|] element denoting a digit group boundary any non-empty source element is mapped 6 * 1 * to itself (if no rules have been applied) 6 Implementation The module that provides word-to-number extraction in accordance to section 4 is written in Perl and derived from the implementation of bidirectional text-to-pronunci- ation conversion [5]. The rules are specified as described in Section 5, one level per file. The file that corresponds to the next level is indicated in header. The basic implementation is deployed online [10] and may be tested alongside with other rule-based sequence-to-sequence conversions. For experiments, the data is read and written in time-marked conversations (ctm- file) format. In Table 3 the aligned input and output lines are presented for a real broadcast transcript leveraged by means of automatic speech recognition for Ukrain- ian broadcast media transcribing system [7]. The numbers, indicated with bold, are extracted as expected. A speech-to-text system produces a sequence of items that are, typically, words contained in the system’s dictionary. 7 Conclusions The described multilevel rule-based system allows for generating hypotheses of word- to-number conversion. Best hypothesis selection is the subject of analysis of lexical, syntactic and prosodic contexts by large corpora. To introduce a new language an expert just need to fill the language-dependent rules mapping to a language-independent number spelling presentation as illustrated in Table 2, rows 1 trough 3. This way a multilingual content might be introduced as well. Further modeling will include appending a suffix for ordinal numbers and extrac- tion of fractions, compound words (like “20-year-old”), time, sport scores and other numerical types. Table 3. Input and output ctm-file comparison. Input sequence Output sequence Start Dura- Word Explanation in Start Dura- Word tion English tion 15.08 0.65 Вісімнадцята Eighteenth 15.08 0.65 18 15.74 0.19 сорок fourty 15.74 0.52 45 15.95 0.31 п'ять five 16.26 0.33 факти “Facts” 16.26 0.33 факти 16.61 0.48 відкриває is opening 16.61 0.48 відкриває 17.11 0.23 новий a new 17.11 0.23 новий 17.34 0.18 дві two 17.34 1.29 2019 17.52 0.36 тисячі of thousand 17.90 0.73 дев'ятнадцятий nineteenth 18.63 0.21 рік. year. 18.63 0.21 рік. 18.84 0.45 Наступні Next 18.84 0.45 Наступні 19.29 0.51 півгодини half hours 19.29 0.51 півгодини 19.80 0.15 про about 19.80 0.15 про 19.95 0.60 найголовніші most important 19.95 0.60 найголовніші 20.55 0.33 події events 20.55 0.33 події 20.88 0.33 другого of second 20.88 0.33 2 21.21 0.42 січня January 21.21 0.42 січня References 1. Gorman, K., Sproat R.: Minimally supervised number normalization. Transactions of the Association for Computational Linguistics 4, 507–519 (2016). 2. He Y. et al.: Streaming End-to-end Speech Recognition for Mobile Devices. In: IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381- 6385. Brighton, United Kingdom (2019). 3. Peyser, C., Zhang, H., Sainath, T.N., Wu, Z.: Improving Performance of End-to-End ASR on Numeric Sequences. In: Interspeech 2019 Proceedings, pp. 2185-2189 (2019). 4. Hinton, G., Deng, L., Yu, D., Dahl, G. et al.: Deep Neural Networks for Acoustic Model- ing in Speech Recognition. Signal Processing Magazine, IEEE 6 (29), 82–97 (2012). 5. Robeiko, V., Sazhok, M.: Bidirectional Text-To-Pronunciation Conversion with Word Stress Prediction for Ukrainian. In: 11th All-Ukrainian International Conference on Signal/Image Processing and Pattern Recognition UkrObraz’2012, pp. 43-46. UAsIPPR, Kyiv, Ukraine (2012). 6. Shirokov, V., Manako V.: Organization of resources for the national dictionary base. Movoznavstvo 5, 3–13 (2001). 7. Sazhok, M., Selyukh, R., Fedoryn, D., Yukhymenko, O., Robeiko V.: Automatic speech recognition for Ukrainian broadcast media transcribing. Control Systems and Computers 6 (264),p. 46-57 (2019). 8. Povey, D., Ghoshal, A., Boulianne, G. et al.: The Kaldi Speech Recognition Toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (2011). 9. Zewoudie, A., Luque, J., Hernando, J.: The use of long-term features for GMM- and i-vec- tor-based speaker diarization systems. EURASIP Journal on Audio, Speech, and Music Processing, 14 (2018). 10. Bidirectional text-to-pronunciation conversion tool, www.cybermova.com/labs, last access 2020/02/20. 11. Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: A General and Effi- cient Weighted Finite-State Transducer Library. In: Holub, J., Žďárek, J. (eds) Implemen- tation and Application of Automata. CIAA 2007. Lecture Notes in Computer Science, vol 4783. Springer, Berlin, Heidelberg (2007).