=Paper= {{Paper |id=Vol-2604/paper32 |storemode=property |title=Written Form Extraction of Spoken Numeric Sequences in Speech-to-Text Conversion for Ukrainian |pdfUrl=https://ceur-ws.org/Vol-2604/paper32.pdf |volume=Vol-2604 |authors=Mykola Sazhok,Valentyna Robeiko,Ruslan Seliukh,Dmytro Fedoryn,Oleksandr Yukhymenko |dblpUrl=https://dblp.org/rec/conf/colins/SazhokRSFY20 }} ==Written Form Extraction of Spoken Numeric Sequences in Speech-to-Text Conversion for Ukrainian== https://ceur-ws.org/Vol-2604/paper32.pdf

Written Form Extraction of Spoken Numeric Sequences
in Speech-to-Text Conversion for Ukrainian

Mykola Sazhok1[0000-0003-1169-6851], Valentyna Robeiko2[0000-0003-2266-7650],
Ruslan Seliukh1[0000-0003-2230-8746], Dmytro Fedoryn1[0000-0002-4924-225X] and
Oleksandr Yukhymenko1[0000-0001-5868-8547]
1
International Research/Training Center for Information Technology and Systems,
Kyiv, Ukraine
2
Taras Shevchenko National University, Kyiv, Ukraine
{sazhok, valya.robeiko, vxml12, dmytro.fedoryn, enomaj}
@gmail.com

Abstract. The result of automatic speech-to-text conversion is a sequence of
words contained in a working dictionary. Hence each number must be added to
the dictionary, which is not feasible. Therefore we need to introduce a post-pro-
cessor block extracting numeric sequences by speech recognition response. We
describe a sequence-to-sequence converter that is a finite state transducer ini -
tially designed to generate phoneme sequences by words for Ukrainian using
the expert-specified rules. Further, we apply this model to extract numeric se-
quence by speech recognition response considering word sequences as well as
time and speaker identity estimations for each word. Finally, we discuss experi-
mental results and spot detected problems for further research.

Keywords: numeric sequence extraction, speech recognition post-processing,
finite state transducer, rule-based conversions

1 Introduction

Human speech contains, depending on a domain, a significant amount of numeric se-
quences, which express cardinal numbers, time and date, addresses currency expres-
sions and so on.
A speech-to-text system produces a sequence of items that are, typically, words
contained in the system’s dictionary.
The system’s productivity depends on the dictionary amount. Taking more space
and computational resources, a larger vocabulary induces additional hypotheses,
which is a source for error increase.
If we consider each number as a valid word, this means that vocabulary expands as
much as numbers might by expressed. Therefore, covering numbers between 1 and
1000000 would hypothetically mean that at least a million of words must be intro -
duced to the vocabulary. For highly inflective languages, like Ukrainian, this amount
is multiplied by the mean number of word forms. Moreover, most of these number are
Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
unseen for the component of an ASR model constraining hypothetical word orders. So
the data sparsity grows drastically.
From the other hand, quite a limited sub-vocabulary of lemmas (stems) is used to
compose a spoken numeric. For Ukrainian, 20 lemmas are sufficient to compose spo-
ken numbers from 0 to 19, nine stems are used for tens, nine stems are used to express
hundreds and, finally, several lemmas name greater digit groups like thousand, mil-
lion and billion. Therefore, in speech-to-text output all numbers are spelled as word
sequences and finding their numeric form looks as a productive way.
The recent works aims to minimize the supervision, which varies much in depen-
dence of the specific language [1,2,3]. The models using an end-to-end recurrent neu-
ral network are effective for English language, as an example, however, as reported, it
does make errors with respect to the numeric value of the expression for highly inflec-
tive languages. Even such extremely rare cases would mislead about the message be-
ing conveyed that is completely unacceptable. The second type of models uses finite-
state transducers constructed with a minimal amount of training data per inflectional
form, on average, that is crucial for highly inflective languages like Ukrainian.
The referred approaches intensively exploit the number verbalization provided by a
text-to-speech system and huge amount of synthesized speech as for the end-to-end
model. That is what is paid to minimize the supervision, which requires huge compu-
tational resources and is not applicable to the matured and generally more productive
HMM/DNN approach [4]. Instead, we retain a reasonable amount of supervision for
tuning the finite-state automata based on [5] and use widely available language
knowledge. This work reports the current state of the research applied to Ukrainian.

2 Selection of Hypothetical Numeric Subsequences

In general, we consider recognition response that includes, beside a word sequence,
estimations for beginning and duration of each recognized word as well as speaker di-
arization labels. Therefore, we may avoid including into hypothetical numeric word
sequences speech and speaker disruptions, since a long pause between speech seg-
ments as well as a speaker change likely cut a numeric sequence. Particularly, our as-
sumption is that a speaker never continues pronouncing the number started by the pre-
vious speaker.
Each word is assigned with either numeric or generic or both labels. In Table 1 we
can see a sequence of 10 words, (w1, w2, … , w10), recognized in the beginning of the
real news episode. The first word meaning “eighteen” starts at 15.08 s and its duration
is 0.65 s as estimated by a speech-to-text converter. The second word is ambiguous
and means either a number or an inflectional form of “magpie” word. As one can see,
in Ukrainian, several numbers are homographs. Also among them are certain forms of
words meaning two, three and five. In this work we label such words as a numeric
word and include them to hypothetically numeric subsequences.
Hence, we selected two numeric word subsequences (w1, w2, w3) and (w7, w8, w9).
From the first subsequence we intend to extract numbers, 18 and 45, whereas the sec-
ond subsequence contains just one number, 2019.
Table 1. Numeric sequence extraction sample analysis.
No Start Dura- Word in Ukrainian Explanation in Eng- Labels Intended out-
time tion lish put
1 15.08 0.65 вісімнадцята eighteenth numeric 18
numeric;
2 15.74 0.19 сорок fourty; of magpies 45
generic
3 15.95 0.31 п'ять five numeric
4 16.26 0.33 факти “Facts” generic факти
5 16.61 0.48 відкриває is opening generic відкриває
6 17.11 0.23 новий a new generic новий
7 17.34 0.18 дві two numeric 2019
8 17.52 0.36 тисячі thousand numeric
9 17.90 0.73 дев'ятнадцятий nineteenth numeric
10 18.63 0.21 рік year generic рік

3 Rule-Based Sequence-to-Sequence Multidecision Conversion
Model

A key issue in modeling the conversion between sequences is the question of how we
define the correspondence between elements of source and target sequences. We con-
N
sider a finite sequence of source elements a1 = ( a1 , a2 , .. . , an , . .. , aN ) where each ele-
ment is taken from the set of input elements, A. Let us construct the conversion of this
sequence to a set of sequences for output elements taken from B.
N
Consider an elementary correspondence f that maps a subsequence of a1 , starting
from its n-th element, to an element from B set or an empty element:
N N
f ( an )=b, a n ∈Def ( f ) ⊂ A , b∈ B ∪ ∅, 1≤ n≤ N. (1)

Note that (1) is applicable only for the specified source sequences. Applying se-
N N
quences of such functions, f n , to the source subsequence a n we attain a set of target
subsequences:
N k N k N k N Lk
{
F ( an )= ( f 1 ( an ) , f 2 (an ) , .. . , f L (a n )) ∈ B ∪ ∅ ,1⩽ k ⩽ K F .
k } (2)

Here Lk is length of k-th target subsequence and the number of the target subse-
quences is KF . Introduced correspondences (2) form F set.
Now we define an operation that concatenates over the sets produced by F and G
taken from F as all possible combinations of target sequences generated by F fol-
lowed by G:
u u u v v v
F ∘ G= ( f 1 , f 2 , . .. , f L , g1 , g2 , . .. , g L ) , 1⩽u⩽ K F , 1⩽v ⩽ K G .
{ u v } (3)
Additionally, we assume that the connection result is empty if at least one of F and G
is empty. Further, we specify ordered correspondences (2) and accomplish them with
additional parameters attaining a set:
~
F= ( F i, d , δ ) , F ∈ F ,1 ⩽ i ⩽|F|, 0:

[]

[ ...]

Here optional components of the template are shown in square brackets. The termi-
nated with ellipsis block might be repeated with different values that will induce mul-
tiple decisions. We will refer to examples in Table 3 for rule specification illustra-
tions.
consists of explicit characters as well as
wildcards replacing any one symbol, ?, and one or more symbols, *, like in examples
1 through 3 in Table 2. Furthermore, the expert may define a subset of characters tak-
ing them in square brackets (samples 4 and 5). Sequence elements are separated by
whitespace.
The only is exclusivity introduced in (4). It is denoted as -x and
used in the pattern that maps an unspecified element to itself like in example 6.
The next pairs of parameters may repeat as many alternative conversions are valid
for the source subsequence.
value stays for the analysis step introduced in (4), and
explains how to generate a target subse-
quence. The wildcard ?, as in samples 4 and 5, stay instead the actually matching
character, i.e., for sample 5, matching to the source pattern “M 2 c |” will be
mapped to “M 2 c 0 d 0 e |”.
Table 3. Examples for the rule specification.

No Source subse- Step Target sub- Explanation
quence sequence
words starting with “тисяч”, a form of
1 тисяч* 1 t
“thousand”, excluding “тисяч” itself → t
words starting with “п’ятис”, a form of
2 п’ятис* 1 5c
“five hundred” → 5c
words starting with “дванадцят”, a form
3 дванадцят* 1 1d2e
of “twelve” → 1d2e
inserting a one, 1, before a thousand that
4 [TBM|] t 1 ?1e
was not pronounced
[TBMt|] zeroing skipped running pair d and e: in-
5 [0123456789] 3 ??c0d0e serts 0d0e after c if c is followed by the
c [TBMt|] element denoting a digit group boundary
any non-empty source element is mapped
6 * 1 *
to itself (if no rules have been applied)

6 Implementation

The module that provides word-to-number extraction in accordance to section 4 is
written in Perl and derived from the implementation of bidirectional text-to-pronunci-
ation conversion [5]. The rules are specified as described in Section 5, one level per
file. The file that corresponds to the next level is indicated in header.
The basic implementation is deployed online [10] and may be tested alongside with
other rule-based sequence-to-sequence conversions.
For experiments, the data is read and written in time-marked conversations (ctm-
file) format. In Table 3 the aligned input and output lines are presented for a real
broadcast transcript leveraged by means of automatic speech recognition for Ukrain-
ian broadcast media transcribing system [7]. The numbers, indicated with bold, are
extracted as expected. A speech-to-text system produces a sequence of items that are,
typically, words contained in the system’s dictionary.

7 Conclusions

The described multilevel rule-based system allows for generating hypotheses of word-
to-number conversion. Best hypothesis selection is the subject of analysis of lexical,
syntactic and prosodic contexts by large corpora.
To introduce a new language an expert just need to fill the language-dependent
rules mapping to a language-independent number spelling presentation as illustrated
in Table 2, rows 1 trough 3. This way a multilingual content might be introduced as
well.
Further modeling will include appending a suffix for ordinal numbers and extrac-
tion of fractions, compound words (like “20-year-old”), time, sport scores and other
numerical types.

Table 3. Input and output ctm-file comparison.

Input sequence Output sequence
Start Dura- Word Explanation in Start Dura- Word
tion English tion
15.08 0.65 Вісімнадцята Eighteenth 15.08 0.65 18
15.74 0.19 сорок fourty 15.74 0.52 45
15.95 0.31 п'ять five
16.26 0.33 факти “Facts” 16.26 0.33 факти
16.61 0.48 відкриває is opening 16.61 0.48 відкриває
17.11 0.23 новий a new 17.11 0.23 новий
17.34 0.18 дві two 17.34 1.29 2019
17.52 0.36 тисячі of thousand
17.90 0.73 дев'ятнадцятий nineteenth
18.63 0.21 рік. year. 18.63 0.21 рік.
18.84 0.45 Наступні Next 18.84 0.45 Наступні
19.29 0.51 півгодини half hours 19.29 0.51 півгодини
19.80 0.15 про about 19.80 0.15 про
19.95 0.60 найголовніші most important 19.95 0.60 найголовніші
20.55 0.33 події events 20.55 0.33 події
20.88 0.33 другого of second 20.88 0.33 2
21.21 0.42 січня January 21.21 0.42 січня

References
1. Gorman, K., Sproat R.: Minimally supervised number normalization. Transactions of the
Association for Computational Linguistics 4, 507–519 (2016).
2. He Y. et al.: Streaming End-to-end Speech Recognition for Mobile Devices. In: IEEE In-
ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381-
6385. Brighton, United Kingdom (2019).
3. Peyser, C., Zhang, H., Sainath, T.N., Wu, Z.: Improving Performance of End-to-End ASR
on Numeric Sequences. In: Interspeech 2019 Proceedings, pp. 2185-2189 (2019).
4. Hinton, G., Deng, L., Yu, D., Dahl, G. et al.: Deep Neural Networks for Acoustic Model-
ing in Speech Recognition. Signal Processing Magazine, IEEE 6 (29), 82–97 (2012).
5. Robeiko, V., Sazhok, M.: Bidirectional Text-To-Pronunciation Conversion with Word
Stress Prediction for Ukrainian. In: 11th All-Ukrainian International Conference on
Signal/Image Processing and Pattern Recognition UkrObraz’2012, pp. 43-46. UAsIPPR,
Kyiv, Ukraine (2012).
6. Shirokov, V., Manako V.: Organization of resources for the national dictionary base.
Movoznavstvo 5, 3–13 (2001).
7. Sazhok, M., Selyukh, R., Fedoryn, D., Yukhymenko, O., Robeiko V.: Automatic speech
recognition for Ukrainian broadcast media transcribing. Control Systems and Computers 6
(264),p. 46-57 (2019).
8. Povey, D., Ghoshal, A., Boulianne, G. et al.: The Kaldi Speech Recognition Toolkit. In:
IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (2011).
9. Zewoudie, A., Luque, J., Hernando, J.: The use of long-term features for GMM- and i-vec-
tor-based speaker diarization systems. EURASIP Journal on Audio, Speech, and Music
Processing, 14 (2018).
10. Bidirectional text-to-pronunciation conversion tool, www.cybermova.com/labs, last access
2020/02/20.
11. Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: A General and Effi-
cient Weighted Finite-State Transducer Library. In: Holub, J., Žďárek, J. (eds) Implemen-
tation and Application of Automata. CIAA 2007. Lecture Notes in Computer Science, vol
4783. Springer, Berlin, Heidelberg (2007).