=Paper= {{Paper |id=Vol-3723/paper18 |storemode=property |title=Topological structure of Ukrainian tongue twisters based on speech sound analysis |pdfUrl=https://ceur-ws.org/Vol-3723/paper18.pdf |volume=Vol-3723 |authors=Tetiana Kovaliuk,Iryna Yurchuk,Olga Gurnik |dblpUrl=https://dblp.org/rec/conf/modast/KovaliukYG24 }} ==Topological structure of Ukrainian tongue twisters based on speech sound analysis== https://ceur-ws.org/Vol-3723/paper18.pdf

Topological structure of Ukrainian tongue twisters based on
speech sound analysis
Tetiana Kovaliuk1,†, Iryna Yurchuk2, ∗ ,† and Olga Gurnik3,†

1,2 Taras Shevchenko National University of Kyiv, Bohdan Hawrylyshyn str. 24, Kyiv, UA-04116, Ukraine

3 Separate Structural Unit “Vocational College of Engineering, Management and Land Management of National

Aviation University”, Metrobudivska str. 5-a, Kyiv, UA- 03065, Ukraine

Abstract
Natural language processing occupies a central place at the current stage of the development of
artificial intelligence and machine learning as its component. This is due not only to the fact that
the ability to conduct a meaningful dialogue is one of the simplest quality of human intelligence,
but also to the fact that there is currently an excessive amount of information in social networks,
news searches, etc., which requires an automated approach to its processing with a specific goal
(prevention terrorist activity, threats, spread of fakes, etc.).
Models aimed at distinguishing meanings and seeing the content of texts, the ability to continue
dialogues, understanding the topic of conversation are useful. In each of the languages, there are
certain classes of texts (poems, idioms, colloquialisms) that are more complex than ordinary
narrative sentences, and require native language processing algorithms to be more trained.
In this work, the authors study tongue twisters to understand their sound composition and
structural features. The authors accentuate special attention to a speech therapy orientation. So,
the speech sounds were classified by labialization, volume, hardness, softness, place and method
of creation. A topological analysis of their structure was implemented, in particular, the Betti
numbers are calculated, and the obtained results are generalized.

Keywords
Ukrainian tongue twister, persistent homology, text vectorization 1

1. Introduction
For every language, tongue twisters as a speech genre are important. They are syntactically
short, correct phrases spoken without context in any language with especially complicated
articulation and combinations of sounds that have different phonemes and are difficult to
pronounce. This is a way to develop the speech skills of children of preschool and primary
school age both for the purpose of improvement and for the therapeutic purpose of

MoDaST-2024: 6th International Workshop on Modern Data Science Technologies, May, 31 - June, 1, 2024, Lviv-
Shatsk, Ukraine
∗ Corresponding author.
† These authors contributed equally.

tetyana.kovalyuk@gmail.com (T. Kovaliuk); i.a.yurchuk@gmail.com (I. Yurchuk); olga.gurnick@gmail.com
(O. Gurnik)
0000-0002-1383-1589 (T. Kovaliuk); 0000-0001-8206-3395 (I. Yurchuk); 0009-0008-4186-3044 (O.
Gurnik)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
eliminating defects. Public figures, actors, singers also use tongue twisters to improve their
skills and build confidence in speeches, performances and recitations.
Tongue twisters are a relatively small part of the language in terms of the number of
available texts. Because often they are devoid of content and are focused on the alternation
of certain sounds, or rather on the difficulty of their reproduction by the speech apparatus
(tongue, lips, etc.).
By I. Yurchuk and O. Gurnik, see [1], the detection of tongue twisters in the Ukrainian
language using vectorization based on letters was implemented and it was obtained that the
average percentage of detection is 80. The main drawback in this work was that the
complexity of the sounds required by the speech apparatus was not taken into account, only
the letters that were part of the colloquial text were to be coded.
This work is a continuation of the study of Ukrainian tongue twisters, with an emphasis
on their use in speech therapy. For this purpose, a speech sound analysis of the patter was
carried out, each speech sound was vectorized by mapping it into a seven-dimensional
space, after which a cloud of points was assigned to each patter, which was investigated
using topological data analysis. In particular, Betti numbers were calculated for each tongue
twister, and based on the obtained values, an analysis was performed.
The purpose of this work is to study the features of tongue twisters in terms of
topological invariants used by speech therapists who deal with both the elimination of
speech defects and the general development of speech in language skills of people of any
age (primary school children, public figures, elderly people who have overcome diseases
affecting the brain).
The aim of research – to propose topological structures, the construction of which will
be informative for understanding the nature of a tongue twister, to establish a set of data,
the integration of which in a certain method of machine learning can understand it in future
research.
To achieve this purpose, the major research objectives are:

 To conduct the formation of a dataset of tongue twisters used by speech therapists
and their sound analysis.
 In accordance with speech therapy requirements, to form criteria and signs for each
sound and build a reflection in the real space of a certain dimension.
 Conduct a topological analysis of each tongue twister and analyze the obtained.

It should be noted that this study is due to the lack of a dataset of such a size that would
guarantee high accuracy due to using the machine learning methods directly.

2. Related works
Works related to the study of tongue twisters and their influence on speech and the
application of topological data analysis to language processing are considered.
In [2], the authors improved a base for the implementation of prosodic strategies in
speech intervention for speakers whose mean age is 54.5 years with spastic (mixed-spastic)
dysarthria of varying etiology (cerebral palsy, multiple sclerosis, multiple system atrophy)
by using tongue twisters.
Tongue twisters play an important role in determining not only speech defects, but also
physiological ones, in particular tumors. By T. Bressmann, A. Foltz, J. Zimmermann, and J. C.
Irish [3], there proposed outcome measures for affect speech production: the patients'
speech acceptability, rate of errors, the time needed to produce the tongue twisters, pause
duration between item repetitions and the tongue shape during the production. They
helped to prove that the surgical resection of the tongue changed the error rate as affect
speech production of speakers with a partial glossectomy. To reproduce a tongue twister,
the speaker has to balance speed and accuracy, therefore the presence of a lingual tumor
and the subsequent glossectomy requires a patient to allocate more resources to the
phonological planning of the tongue twister because of the structural alteration of the
tongue.
We have to remark that tongue twisters can be an effective instrument to research inner
speech which plays a key role in a variety of different cognitive activities, including writing,
personal thought, reasoning and memorization, see [4].
Amount what has been implemented so far in the areas of language processing by using
topological data analysis, we highlight the following works: a providing distance
measurement between poet literary styles [5], an investigation of interpretable topological
features of any transformer-based LM with like-surface structure and structural properties
[6], as an analog to bag-of-words, realization of persistence bag-of-words which is stable
vectorized representation that enables the seamless integration with machine learning [7],
text classification and visualization [8-10]. First paragraph in every section does not have a
first-line indent. Use only styles embedded in the document.

3. Methods
It is known that there are several methods or, more precisely, paradigms for the machine's
work on texts. Let's briefly review the main ones:

 Neural networks: Recurrent Neural Networks (RNNs), Long Short-Term Memory
(LSTM) and Gated Recurrent Unit (GRU) are classes of neural networks that have
been developed specifically for processing sequential data such as text, audio, time
series, etc. The basic idea of recurrent neural networks is that they have the ability
to remember the previous state (information) and use it to process the next input in
the sequence. LSTM has additional internal structures (gates), and GRU has
mechanisms of forgetting and updating. The best choice between LSTM and GRU
depends on the size of the data and the specifics of the task. LSTM can be useful when
long-term memory is important, but it requires more resources to train. GRU is less
complex and faster to learn, but may be less powerful in solving some problems.
Word2vec uses a neural network model to learn word associations from a large text
corpus. It can detect synonyms or suggest additional words for a partial sentence.
 Transformers: BERT (Bidirectional Encoder Representations from Transformers) is
a deep learning model based on the Transformer architecture and used to solve
Natural Language Processing (NLP) problems. BERT is one of the most effective
models for context-based language understanding and has gained significant
popularity since its launch. Tasks that can be solved with BERT include text
classification, named entity recognition, question answering, and many other
natural language processing tasks. BERT has an impressive ability to understand
complex language constructions and semantics thanks to its ability to model context
in both directions.
 Unsupervised learning algorithm: GloVe makes mapping of words to a meaningful
space in which the distance between words is related to semantic similarity.
Training is performed on the aggregated global statistics of pairwise co-occurrence
of corpus words, and the resulting representations demonstrate interesting linear
substructures of the word vector space.

We have to remark that all of the above approaches require large datasets, painstaking
work on their cleaning and labeling. The main disadvantage of all is the fact that the larger
the sample (dataset), the better the results. Moreover, the amount of training data is
expressed in thousands of units. That is why the authors propose the approach described
in this section.
In this section, vectorization of the words, dataset and main terms of persistent
homology are considered.

3.1. Principles of speech sound coding
Every speech sound 𝑥 corresponds to a vector 𝑥⃗ = (𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , 𝑥6 , 𝑥7 ), where:

 𝑥1 be the ordinal number of a speech sound in the text.
 𝑥2 be the ordinal number of the word in the text which contains a speech sound 𝓍.
 𝑥3 equals to 1 for labialized vowel speech sounds and 2 for non-labialized vowel speech
sounds. If a speech sound is a consonant, 𝑥3 is equal to a zero.
 𝑥4 be a consonant sound by volume: sonorous, voiced and voiceless. If a speech sound is
a vowel, 𝑥4 is equal to a zero.
 𝑥5 be a consonant sound by a place of creation: labial, nasal, lingual and laryngeal. If a
speech sound is a vowel, 𝑥5 is equal to a zero.
 𝑥6 be a consonant sound by the method of creation: closed (breakthrough) - sounds are
created at the moment of breakthrough by an air stream of closed speech organs (they are
also called breakthrough, explosive, instantaneous, because the creation of such sounds is
fast, it cannot be prolonged; they are not elongated ); fricative - sounds are made when a
stream of exhaled air passes through the gap (whistling and hissing) of the speech organs
(can be lengthened, drawn out); closed-through sounds combine moments of closure and
breakthrough during their creation; affricates (closed-cleft or merged); trembling (or
vibrating). If a speech sound is a vowel, 𝑥5 is equal to a zero.
 𝑥7 be a consonant sound by hardness and softness: hard, soft, softened (palatalized) and
semi-softened (semi-palatalized). If a speech sound is a vowel, 𝑥5 is equal to a zero.
Let consider an example of mapping a tongue twister “Yila Maryna malynu” into 𝑍 7 , see
Table 1. Every speech sound corresponds to unique point in 𝑍 7 . Moreover, all coordinates
of a point are non-negative integers.

Table 1
A map of a tongue twister “Yila Maryna malynu” into 𝒁𝟕
A 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7 A 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7
speech speech
sound sound
y 1 1 0 2 3 2 2 n 9 2 0 1 2 3 1
i 2 1 2 0 0 0 0 a 10 2 2 0 0 0 0
l 3 1 0 1 3 2 1 m 11 3 0 1 2 3 1
a 4 1 2 0 0 0 0 a 12 3 2 0 0 0 0
m 5 2 0 1 2 3 1 l 13 3 0 1 3 2 1
a 6 2 2 0 0 0 0 y 14 3 2 0 0 0 0
r 7 2 0 1 3 5 1 n 15 3 0 1 2 3 1
y 8 2 2 0 0 0 0 u 16 3 1 0 0 0 0

Let determine the importance of the first coordinate in such vectorization process. Some
speech sounds can be the same in terms of labialization, volume, hardness, softness, place
and method of creation, and be components of the same word. However, the sequence of
their pronunciation will always be different. In the example in Table 1, such speech sounds
are ”y” and “a”.
Since the mapping is carried out in a seven-dimensional space, any visualization is
complicated by the human perception so it is necessary to reduce the dimensions.
In Fig. 1, there are two projections of points corresponding to a tongue twister “Yila
Maryna malynu” into three-dimensional space up to the different coordinates.

Figure 1: The tongue twister “Yila Maryna malynu”: it is presented in the three-
dimensional spaces 𝑋1 𝑋2 𝑋3 [left] and 𝑋1 𝑋3 𝑋6 [right]

3.2. A dataset
For research, the authors made a dataset that contains tongue twisters from open sources
and they are used by speech therapists for the purpose of eliminating and preventing speech
defects in children's speech skills. All of them have different quantities of speech sounds and
are oriented on different types of speech problems. In Fig. 2 there is a histogram for the
quantity of speech sounds in a tongue twister.

Figure 2: A histogram of the tongue twisters dataset, where axis X is the quantity of speech
sounds in a tongue twister

There are 100 tongue twisters in the dataset. It should be noted that tongue twisters that
contain no more than 50 sounds make up the majority of the available dataset. Most likely,
this phenomenon is due to the fact that long tongue twisters are rarely used for therapeutic
purposes. The most widely used tongue twisters are such that contain from 30 to 40 speech
sounds.
As can be seen from the histogram, this distribution is far from normal. Therefore,
following general practice, it is necessary to remove atypical tongue twisters from this
sample. However, based on the small amount of data in the dataset, the authors avoid this.

3.3. Persistent homologies
For construction and analyzing the structure of tongue twisters, the concepts of topological
data analysis will be used. In particular, we will be interested in the concept of Betti
numbers and their geometric interpretation, see[11, 12].
i,j
The zero Betti number (𝛽0 =rank 𝐻0 ) is the amount of connected components of the
i,j
space. The first Betti number (𝛽1 =rank 𝐻1 ) is the amount of cycles in the space. The second
i,j
Betti number (𝛽2 =rank 𝐻1 ) is the amount of 2-spheres in the space. For calculating this
i,j 𝑖,𝑗
invariant we used the l-th persistent homology 𝐻l , which is Im 𝑓𝑙 for 0≤ i