=Paper= {{Paper |id=Vol-2870/paper21 |storemode=property |title=Using Word2vec Technique to Determine Semantic and Morphologic Similarity in Embedded Words of the Ukrainian Language |pdfUrl=https://ceur-ws.org/Vol-2870/paper21.pdf |volume=Vol-2870 |authors=Larysa Savytska,Nataliya Vnukova,Iryna Bezugla,Vasyl Pyvovarov,M. Turgut Sübay |dblpUrl=https://dblp.org/rec/conf/colins/SavytskaVBPS21 }} ==Using Word2vec Technique to Determine Semantic and Morphologic Similarity in Embedded Words of the Ukrainian Language== https://ceur-ws.org/Vol-2870/paper21.pdf
Using Word2vec Technique to Determine Semantic and
Morphologic Similarity in Embedded Words of the Ukrainian
Language
Larysa Savytskaa, Nataliya Vnukovaa, Iryna Bezuglaa, Vasyl Pyvovarovb and M. Turgut
Sübayc
a
  Simon Kuznets Kharkiv National University of Economics, Nauky av. 9a, Kharkiv, 61166, Ukraine
b
  Yaroslav Mudryi National Law University, Pushkinska str. 77, Kharkiv, 61024, Ukraine
c
  Piramit Danismanlik AS.: İstanbul, Kadıköy, Turkey


                 Abstract
                 The study presents the word translation into vectors of real numbers (word embeddings), one
                 of the most important topics in natural language processing. Word2vec is the latest
                 techniques developed by Tomas Mikolov to study high quality vectors. The majority of
                 studies on clustering the word vectors were made in English. Dmitry Chaplinsky has already
                 counted and published vectors for the Ukrainian language by using LexVec, Word2vec and
                 GloVe techniques, obtained from fiction, newswire and ubercorpus texts, for VESUM
                 dictionary and other related NLP tools for the Ukrainian language. There was no research
                 done on the vectors by using Word2vec technique to create Ukrainian corpus, obtained from
                 Wikipedia dump as the main source. The collection contains more than two hundred and
                 sixty one million words. The dictionary of words (unique words) obtained from the corpus is
                 more than seven hundred and nine thousand. The research using machine technology
                 Word2vec is of great practical importance to computerise many areas of linguistic analysis.
                 The open-source Python programming language was used to obtain word vectors with
                 Word2vec techniques and to calculate the cosine proximity of the vectors. In order to do
                 machine learning with Word2vec techniques on Python, a resource containing open source
                 licensed software libraries called "Gensim" was used. Calculations regarding the cosine
                 affinities of the obtained vectors were made using "Gensim" libraries. The research
                 examining the clustering of the word vectors obtained from the Ukrainian corpus was made
                 considering the two sub-branches of linguistics, semantics and morphology (language
                 morphology). Firstly, it was investigated how accurately the vectors are obtained from the
                 Ukrainian corpus and how the words represent the cluster they belong to. Secondly, it was
                 investigated how word vectors are clustered and associated respectively to the morphological
                 features of the suffixes of the Ukrainian language.

                 Keywords 1
                 word2vec, NLP, cosine similarity, semantic relations, morphological (linguistics) relations,
                 word vectors, word embedding, Ukrainian language

1. Introduction
   Since the first years of computer science and technology, each technological step has offered
humanity the possibility of storing larger amounts of data in smaller volumes and in cheaper way.
Storage of large amounts of data, rapid analysis on data and data sharing has made computer science

COLINS-2021: 5th International Conference on Computational Linguistics and Intelligent Systems, April 22–23, 2021, Kharkiv, Ukraine
EMAIL: larisa-savickaya@hotmail.com (L. Savytska); vnn@hneu.net (N. Vnukova); iryna.bezugla@hneu.net (I. Bezugla);
v.pyvovarov@ukr.net (V. Pyvovarov); m.turguts@hotmail.com (M. T. Sübay);
ORCID: 0000-0002-9158-6304 (L. Savytska); 0000-0002-1354-4838 (N: Vnukova); 0000-0002-6285-2060 (I. Bezugla); 0000-0001-9642-
3611 (V. Pyvovarov); 0000-0002-2967-694X (M. T. Sübay)
            ©️ 2021 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
an important field of activity. Nowadays, computer technologies offer products that can be useful to
all consumers of information services.
    The large size of the data stored by computers brings the problem of finding the data quickly. The
solution techniques of this problem constitute an important industrial field and are continuously being
developed. Man-made analysis of data that needs content, meaning, emotion, commercial and similar
needs is slow with growing data. The processing of big data with human control brings along a high
cost problem. The solution of these problems in the line with industrial needs shows itself as software
technologies that can perform automatic data analysis without human assistance. In today's world,
automatic analysis is constantly being developed to meet the increasing industrial needs. Thanks to
automatic analysis, information access, identifying people or objects from photographs, distinguishing
the advertising contents of e-mails, analyzing sentiments in correspondence, translation between
languages and many similar needs can be met. The research using machine technology Word2vec is
of great practical importance to computerise many areas of linguistic analysis such as
     identifying semantic similarity of words and phrases
     automatic clustering of words according to the degree of their semantic closeness
     automatic generation of thesaurus and bilingual dictionaries
     resolving lexical ambiguity
     expanding queries due to associative connections
     defining the subject of the document
     clustering of documents for information retrieval
     extracting knowledge from texts
     constructing semantic maps of various subject areas
     modelling of periphrases
     determining the tone of the statement
     modelling of compatibility of word constraints.

2. Analysis of Publications
    English mathematician Alan Turing asked a question "Can machines think like a human?" This
proposal opened the idea of artificial intelligence and led to discussion [1] that artificial intelligence
technologies can learn like humans and communicate with people.
    Natural language processing started in 1950 with Alan Turing's publication "Computing
machinery and intelligence" (known as "Turing test") [2, 3]. He is known as the father of theoretical
computer science and artificial intelligence [1].
    In 1986, David E. Rumelhart introduced the back propagation of the error to the world of artificial
intelligence as a new learning technique in his study named "Learning representations by back-
propagating errors" [4]. What is tried to be done in the new learning technique is the comparison of
the vector that is known to be correct during the training phase and the vectors whose accuracy is
estimated. The error value is obtained by making use of the difference between two vectors in
comparison. Thanks to the obtained error value, corrections in the dimensional weights of the vectors
are the basic principle on which the new technique is based. The process includes input vector,
nonlinear hidden steps and output vector. The backward propagation of errors has made a significant
contribution to the development of artificial neural networks.
    In the error propagation technique, input data is given to the network as a vector and an output
vector is produced. An error / loss amount (Error signal / Loss function) is calculated by comparing
this output produced by the network with the training data whose result is known. Dimensional
weights in the artificial neural network are updated according to the amount of error obtained. These
updates are continued until the error amount drops to an appropriate level.
    In the early 1990s, the error propagation technique began to be used in natural language
processing. Yoshua Bengio, one of those who has made significant contributions to the development
of natural language processing, has studied the Recurrent Neural Network (RNN) [5, 6, 7]. The RNN
technique is based on the error propagation technique. The fact that statistics-based techniques are not
suitable for practical use due to the long training period has prompted researchers to study more on
RNN [8]. With the adoption of the use of back propagation of error in natural language processing,
discussions on learning techniques, deep learning and comparison of shallow learning techniques
have become a broad area of research [9]. Tomas Mikolov stated in his studies on RNN [10] that
statistics-based models do not reflect the meaning relationships of words well. He also stated that the
vectors they obtained with the RNN model they developed can represent words more accurately.
    Today, the technique of correcting the error backwards in natural language processing has become
a current research area. The aim of this learning technique is to find the vector equivalents of the word
in the multi-dimensional coordinate system consisting of real numbers. It is desired that the vectors
corresponding to the words represent the word correctly (high quality vectors) in accordance with the
structure of the language. The increase in vector quality increases their ability to reflect the versatile
relationships found between words. Semantic results can be obtained by adding and subtracting
quality vectors [11].

3. Using Word Embedding techniques in Research
    The clusters and sub-clusters between the vectors obtained by machine learning are parallel in
terms of the syntax of words, semantic and formal (structural) relationships. These relationships
between words find wide application especially in industrial areas such as search engines. In natural
language processing, the matching of words with vectors (finding word vectors) techniques are called
word embeddings [12]. Vectors obtained with word embedding reflect the syntax and meaning
relationships of the word without the need for human interaction [13]. One of the reasons for the
development of word embedding techniques is that it shortens the machine learning training time. The
shortening of the training period provides the opportunity to work with more vector dimensions and
larger collections in practice. Being able to train machine learning with large corpus and more vector
dimensions is shown among the important factors affecting the correct representation of words by
vectors.
    The three main factors that affect the correct representation of the word by vectors obtained by
natural language processing machine learning are listed below.
    1. The size of the corpus used in education is an important factor influencing the results
    obtained [8]. Increasing the corpus size causes more error correction operations on the weight of
    word vectors. It is possible to obtain more accurate vector values by increasing the corrections
    made on vector weights. The most important disadvantage of the growth of the corpus is that it
    extends the training period. This problem is possible to be overcome by developing new
    techniques [11, 14].
    2. The size of the trained vectors is another important factor affecting the vector's ability to
    represent the word correctly. The vectors are represented in four dimensions. Each dimension of
    word vectors can be compared to the features related to supervised learning. It is observed that as
    the number of dimensions in unsupervised learning increases, similar to the features in supervised
    learning, the word vectors represent the word more accurately. In statistics-based techniques that
    are older than Word2vec techniques, the size of the vectors was given between 50 and 100 in order
    to keep the training time short. Y. Bengio stated in his publications that vector size growth
    naturally increases training times [7, 15]. The problem of long training periods has been reduced to
    a lesser extent in Word2vec. The number of vector sizes has been increased for Word2vec with the
    reduction of the time problem. It is recommended to specify the vector size between 300 and 1000
    in Word2vec [11].
    3. The number of neighbouring words used during training is one more important factor
    affecting the correct representation of the word by vectors. Increasing the number of neighbour
    words causes more error correction calculations to be performed on vectors. Increasing error
    corrections on vectors causes vectors to take more accurate weight values, but it has an increasing
    effect on training time. 5 to 10 neighbourhoods are recommended for Word2vec.
    The size of the corpus, the size of the trained vectors and the number of adjacent words are
important factors affecting vector quality. The study comparing the training times between the
Word2vec techniques and the Neural Network Language Model (NNLM) technique, which Google
researcher Tomas Mikolov and his team have recently developed in their publications named
"Efficient Estimation of Word Representations in Vector Space" [11], is given in the table below.

Table 1
Comparison of machine learning time between techniques (Although Word2vec technique has more
vector dimensions than old techniques, it completes the training in a shorter time)
   Model        Vector         Training              Accuracy [%]                   Training time
            Dimensionality      words                                             [days x CPU cores]
                                            Semantic Syntactic         Total
   NNLM           100             6B          34.2        64.5         50.8            14 x 180
   CBOW          1000             6B          57.3        68.9         63.7            2 x 140
 Skip-gram       1000             6B          66.1        65.1         65.6           2.5 x 125

    The Neural Network Language Model shown in Table 1 is an older technique than Word2vec. The
ensemble of consecutive words (CBOW) and prediction of neighbour words (Skip-gram) represent
Word2vec algorithms. As can be seen in this table, in a collection composed of 6 billion words, the
training process of NNLM technique is completed in 14 days with 180 central processing units (CPU
cores). The first of two algorithms of the Word2vec completes the machine learning in 2 days with
CBOW 140 CPU cores and the second with Skip-gram 125 CPU cores in 2.5 days.

4. Analysis of the Word2vec as a new technique in NLP
    In the techniques used in natural language processing, the long duration of education has led
researchers to reduce the time and thus to obtain more accurate word vectors by processing large
collections. Google researcher Tomas Mikolov and his team, who conducted research on this issue,
announced in 2013 that they have developed a new technique called Word2vec [11]. Word2vec
technique is based on the error propagation technique similar to RNN. As it is seen in Table 1, the two
algorithms in the newly developed Word2vec technique give better results than the NNLM technique.
T. Mikolov states that the vectors obtained with Word2vec are clustered similar to the syntax and
semantic relations of the words in the natural language.
    Examples of syntax relationships between words in English are "great", "greater" or "easy",
"easiest" word pairs. Similar to the syntax relations in these word pairs, T. Mikolov states that the
vectors obtained with Word2vec are clustered.
    T. Mikolov also shows with examples when the vectors obtained with Word2vec are clustered
according to the semantic relations. As English word pairs, the word pair "Athens", "Greece" are in
the semantic relation of country and capital. Similarly, the word pair "King" and "Queen" is in a
semantic relation as expressions of nobility.
    Semantic relationships between words cause the word vectors to cluster according to the semantic
relationships of the words to which they belong. The correct representation of the word enables to
obtain logical results. Semantic results can be obtained from the cosine similarities of the new vector
obtained by adding and subtracting the vectors obtained with Word2vec. To give an example of
semantic results that can be produced by arithmetic operations, the result vector obtained by replacing
the gender feature in the word "King", which expresses nobility, is shown below [11].
    ('King') - ('Man') + ('Woman') result ('Queen')
    In another example similar to this example, the semantic relationship between countries and their
capitals is given below.
    (England ’) - (London’) + (‘Athens) result (‘Greece)
    Ukrainian is included in the East Slavic group of the Slavic branch of the Indo-European language
family. English is also included in the Indo-European language family but belongs to the Germanic
branch of languages. Ukrainian and English belong to the inflected languages. Their essential
character is the division into languages of analytical and synthetical structure. The analytical structure
presupposes the wide use of functional words, phonetic means and word-order for the expression of
grammatical meanings. English is the language with analytical structure. The synthetical structure is
characterized by the greater role of the word forms which are created with the help of flexions and
also word-forming suffixes and prefixes in the expression of grammatical meanings. Ukrainian
belongs to the languages of synthetical structure. The examples above are given in English and it is
possible to reproduce them, the corpus trained with Word2vec is mostly in English and a lot of work
has been done on them.
    Tomas Mikolov said "we should not expect only similar words to come close to each other, as
there may be similarities in more than one way"[11]. In the example given in the same source, T.
Mikolov stated that nouns can end with more than one suffix, and when searching for similar words,
words that end with similar affixes can be reached. "... with the expectation that not only will similar
words tend to be close to each other, but that words can have multiple degrees of similarity [16]. It
has been observed earlier in the context of inflectional languages – for example, nouns can have
multiple word endings, and if we search for similar words in a subspace of the original vector space, it
is possible to find words that have similar endings [4, 17]".
    In the literature review it was seen that some research was done on the vectors obtained by using
Word2vec technique from the Turkish corpus, considering the meaning relation and formal features of
the words [18]. Dmytro Chaplinskyi has already counted and published vectors for the Ukrainian
language by using LexVec, Word2vec and GloVe techniques. It was a project to create corpus,
obtained from fiction, newswire and ubercorpus texts, for VESUM dictionary and other related NLP
tools for the Ukrainian language. [19]. There is no research done on the vectors obtained by using
Word2vec technique to create Ukrainian corpus, obtained from Wikipedia dump as the main source.
For this reason, vectors obtained from the Ukrainian corpus prepared for our research (with
Word2vec) were subjected to consider the meaning relation and formal features of the words.
    The words used in the natural language can have different meanings by establishing a contextual
relationship with neighbouring words in their use in sentences. The contextual relationship between
words leads to the establishment of multi-directional similarity relationships. These similarities may
also occur according to the suffixes taken in inflected languages such as Ukrainian. This language is
full of inflexion. Affixal morphemes in Ukrainian are mainly suffixes and sometimes prefixes. Even
the number of suffixes considerably exceeds the number of prefixes. It was also stated by T. Mikolov
that names can end with more than one suffix, and when searching for similar words in Word2vec, it
is possible to reach words that end with similar suffixes [11].

5. Ukrainian Corpus trained with Word2vec: Tests and Results
   The open-source Python programming language was used to obtain word vectors with Word2vec
techniques and to calculate the cosine proximity of the vectors. In order to do machine learning with
Word2vec techniques on Python, a resource containing open source licensed software libraries called
"Gensim" was used [20, 21]. "Gensim" libraries were used to calculate the cosine similarities of the
vectors. Continuous Bag-of-words (CBOW) algorithm is used for machine learning. The vector size is
taken as three hundred and the window size as ten. The learning process is done in five cycles
(EPOCH).
   The cosine similarity values in the results are derived from the weight (dimension values) of the
vector dimensions. Vector weights vary according to the size of the corpus used in education, the
number of neighbouring words and the vector size. When a collection is trained twice with the same
parameters, the cosine-like vector closest to the resulting vectors is expected to remain unchanged.
Since the initial weights of the vectors are initialized by random value assignments, differences may
occur in the result vectors and cosine similarity values. If the differences are large, it may be
considered that the collection is not large enough. Increasing the number of cycles contributes to the
vectors getting more accurate values.
   Ukrainian corpus, obtained from Wikipedia dump as the main source, was used in the research.
These are texts with general subject content. The collection contains more than two hundred and sixty
one million words. The dictionary of words (unique words) obtained from the corpus is more than
seven hundred and nine thousand. Such a big data gives an opportunity to conduct high quality
semantic and morphologic analysis and Arithmetic operations of word vectors.
    5.1. Semantic clustering of Ukrainian word vectors
    Word vectors obtained with Word2vec over the general content Ukrainian corpus are clustered and
related in terms of semantic relations of Ukrainian words.
    The first example is the word "Яблуня". The first five word vectors with the closest cosine
similarity to ('яблуня') vector are shown below.

   [('груша', 0.8305871486663818),
   ('ожина', 0.8061103224754333),
   ('суниця', 0.8029876947402954),
   ('черешня', 0.7985619306564331),
   ('шовковиця', 0.797329306602478)]

   The word "Яблуня" in the Academic Explanatory Dictionary of the Ukrainian language [22] is
defined as
   1. Noun, garden and forest fruit tree of the rose family with mostly spherical fruits.
   It is clearly seen that among the vectors obtained by training from the Ukrainian corpus, there is
the vector ('груша') as the closest cosine vector to the ('яблуня') vector. The word 'груша' in the
Academic Explanatory Dictionary of the Ukrainian language is defined as
   1. Noun, garden and forest fruit tree with dark green dense leaves and fruits, mostly in the shape
   of a rounded cone.
   2. Noun, the fruit of this tree.
   3. Noun, an object that has the shape of the fruit of this tree [22].
   It can be clearly seen that the two words are in a semantic relationship.
   Among the vectors obtained by training on the Ukrainian collection, there is the second closest
cosine-like ('ожина') vector to ('яблуня') vector. The word 'ожина' in the Academic Explanatory
Dictionary of the Ukrainian language is defined as
   1. Noun, a perennial shrub or semi-shrub prickly plant with arcuate branches and edible fruits.
   2. Noun, this plant’s berries are black and grey [22].
   It can be clearly seen that two words are in a paradigmatic relationship.
   The other results obtained by training on the Ukrainian corpus are vectors belonging to lexical
paradigm of the words representing the names of fruit trees, related to the meaning of the word
"Яблуня".
   As a result of training the word "Картопля", the first five word vectors with the closest cosine
similarity to the vector ('картопля') are shown below.

   [('квасоля', 0.8419397473335266),
   ('помідори', 0.8089007139205933),
   ('баклажани', 0.788599967956543),
   ('морква', 0.7878426313400269),
   ('кабачки', 0.787842631340434)]

   The word "Картопля" in the Academic Explanatory Dictionary of the Ukrainian language is
defined as
   1. Noun, an annual herbaceous plant with edible tubers rich in starch.
   2. Noun, tubers of this plant, used as food and as animal feed [22].
   Among the vectors obtained by training on the Ukrainian corpus, the first closest cosine-like
vector to ('картопля') vector is ('квасоля') vector. The word "Квасоля" in the Academic Explanatory
Dictionary of the Ukrainian language is defined as
   1. Noun, a garden herbaceous annual plant of the legume family, which has oval grains in the
   pods.
   2. Noun, the fruits (oval grains) of this plant, which are eaten [22].
   It is clearly seen that the two words are in a semantic affinity relationship. Referring to the other
vectors such as 'помідори' and 'баклажани', they belong to lexical paradigm of the words
representing the names of vegetables, related to the meaning of the word 'картопля'.
   As a result of training the word "Харків", the city name, the first five word vectors with the closest
cosine similarity to the vector ('харків') are shown below.

   [('київ', 0.6517883539199829),
   ('дніпро', 0.5908591747283936),
   ('полтава', 0.5591470003128052),
   ('донецьк', 0.5527917742729187),
   ('львів', 0.5408258438110352)]

   The word "Харків" in the Universal dictionary-encyclopaedia is defined as
   Regional centre of Ukraine, in the place of convergence for the Kharkiv, Lopan and Udy rivers
(Severskyi Donets Basin); 1.6 million people (second after Kyiv in terms of population in Ukraine)
[23].
   Among the vectors obtained by training on the Ukrainian corpus, the first closest cosine-like
vector to ('харків') vector is ('київ') vector. The word "Київ" in the Universal dictionary-
encyclopaedia is defined as
   The capital of Ukraine, the city of state subordination, the centre of the region and Kyiv-
Sviatoshynskyi district; on both sides of the Dnieper River, in its middle course, below the confluence
of the left tributary of the Desna, the rivers Lybid, Syrets, Vita (right tributaries of the Dnieper),
Gorenka, and Nivka (a tributary of the Irpen) also flow along the Kyiv; 2.6 million people [23].
   It is clearly seen that the two words are in a semantic affinity relationship. Referring to the other
vectors such as 'дніпро' and 'полтава', they belong to lexical paradigm of the words representing the
other cities names in Ukraine, related to the meaning of the word 'Харків'.
   As a result of training the word "Микола", a proper name, the first five word vectors with the
closest cosine similarity to the vector ('Микола') are shown below.

   [('михайло', 0.8178503513336182),
   ('олександр', 0.7979997396469116),
   ('василь', 0.7977378368377686),
   ('федір', 0.7911434173583984),
   ('петро', 0.7876654267311096)]

    The word "Микола" is used as a neutral man’s name in Ukrainian. Among the vectors obtained by
training from the Ukrainian corpus the closest cosine-like ('микола') vector is the ('михайло') vector.
The word "Михайло" is also used as a neutral male name in Ukrainian language. When the vectors
that similar to the closest cosine to the vector ('михайло') are examined, the vectors belonging to
words / proper names representing male names in similar usage with the use of the word "Михайло"
are investigated. From the results, it is clearly seen that there is a semantic cluster related to the usage
area of the word "Микола".
    As a result of training the word "Леся", a proper name, the first five word vectors with the closest
cosine similarity to the vector ("Леся") are shown below.

   [('оксана', 0.6321463584899902),
   ('соломія', 0.5867637395858765),
   ('орися', 0.5651060342788696),
   ('михайлина', 0.5579890012741089),
   ('наталя', 0.5555435419082642)]

   The word "Леся" is used as a specific woman’s name in Ukrainian. Among the vectors obtained
by training from the Ukrainian corpus the closest cosine-like ('леся') vector is the ('оксана') vector.
The word "Оксана" is also used as a woman’s name in Ukrainian language. When the vectors that
similar to the closest cosine to the vector ('оксана') are examined, the vectors belonging to words /
proper names representing female names in similar usage with the use of the word "Оксана" are
investigated. From the results, it is clearly seen that there is a semantic cluster related to the usage
area of the word "Леся".
   According to the results, the words "Микола" and "Леся" differ in their gender characteristics but
they are in a semantic cluster related to proper names either neutral or specific.
   As a result of training the word "Білий", qualitative adjective, the first five word vectors with the
closest cosine similarity to the vector ('білий') are shown below.

     [('чорний', 0.758683443069458),
     ('блакитний', 0.6943730711936951),
     ('жовтий', 0.6647096872329712),
     ('синій', 0.6603621244430542),
     ('червоний', 0.6490797996520996)]

     The word "Білий" in the Academic Explanatory Dictionary of the Ukrainian language is defined
as
    1. Adjective. It has the colour of chalk, milk, snow; the opposite is black [22].
    Among the vectors obtained by training from the Ukrainian corpus, the first closest cosine-like
vector to ('білий') vector is ('чорний') vector. The word "Чорний" in the Academic Explanatory
Dictionary of the Ukrainian languageis defined as
    1. Adjective. Colour of soot, coal, and the darkest; opposite white [22].
    It is clearly seen that two words are in a semantic relationship, even they are antonyms. Referring
to the other vectors such as 'блакитний' and 'жовтий', they belong to lexical paradigm of the words
representing the names of colours, related to the meaning of the word 'білий'.
    According to the results obtained by training the Ukrainian corpus it is proved that the vectors are
clustered and related in terms of semantic relations of Ukrainian words.

5.2. Arithmetic operations of word vectors and semantic relationships
between words
    New vectors can be obtained as a result of adding and subtracting (arithmetic operations) the word
vectors obtained from the Ukrainian corpus.
    The first example is similar to the English example, showed by T. Mikolov [11] obtained from the
English corpus when the cosine analogues of the new vector are obtained by adding and subtracting
the vectors.
    (‘king’) - (‘man’) + (‘woman’) = (‘queen’)
    The first five word vectors with the closest cosine similarity to the result vector of ('король') -
('чоловік') + ('жінка') operation are shown below.
    Cosine similarity results of ('король') - ('чоловік') + ('жінка') operation result vector:

     [('королева', 0.6145955324172974),
     ('принцеса', 0.46264657378196716),
     ('правителька', 0.45916682481765747),
     ('корона', 0.44287776947021484),
     ('королевою', 0.42545855045318604)]

    The result obtained from the Ukrainian corpus is similar to the result obtained from the English
corpus. The word "Королева" is the Ukrainian equivalent to the word "Queen". The vector was found
to be the closest cosine-like vector to the result from the process.
    The ('король') - ('чоловік') + ('жінка') operation is the replacement of the gender feature in the
word "Король", which expresses nobility. In terms of the word meaning, the result of the process is
the word "Королева". It is seen that the word meaning is compatible with the result of adding and
subtracting the vectors. The word "Королева" is defined as "the wife of the king or the woman who
rules the kingdom" in the Academic Explanatory Dictionary of the Ukrainian language [22].
    The first five word vectors with the closest cosine similarity to the result vector of ('англія') -
('лондон') + ('київ') operation are shown below.
    Cosine similarity results of ('англія') - ('лондон') + ('київ') operation result vector:

     [('україна', 0.5563449859619141),
     ('одеса', 0.4825827181339264),
     ('чернігів', 0.4495120048522949),
     ('дніпропетровськ', 0.4358903467655182),
     ('харків', 0.40171462297439575)]

     The ('англія') - ('лондон') + ('київ') operation is the transaction of the relationship between
countries and cities (or their capitals). According to the results of operations with the word "Україна",
it is seen that the word meanings are compatible with the result of adding and subtracting vectors.
     The first five word vectors with the closest cosine similarity to the result vector of ('банк') -
('золото') + ('кредит') operation are shown below.
     Cosine similarity results of ('банк') - ('золото') + ('кредит') operation result vector:

     [('капітал', 0.5785767436027527),
     ('кредитний', 0.5400320291519165),
     ('інвестор', 0.5385503172874451),
     ('банкінг', 0.5157139301300049),
     ('позичальник', 0.5080622434616089)]

     The word "Банк" in the Great explanatory dictionary of the modern Ukrainian language is defined
as
    1. A credit and financial institution that concentrates funds and investments, provides loans,
    makes cash settlements between enterprises or individuals, and regulates money circulation in the
    country, including the issuance of new money [24].
    The word "Золото" in the Great explanatory dictionary of the modern Ukrainian language is
defined as
    1. The simple substance of the chemical element Aurum is yellow, soft, malleable metal.
    2. Golden things; expensive gold-woven clothes, etc.
    3. Gold coins, money, etc.
    4. About something very valuable, beautiful or about someone worthy of respect.
    5. About gold medal (gold medals) for a victory, the first place in sports competitions, on
    competition, etc. [24].
    The word "Кредит" in the Great explanatory dictionary of the modern Ukrainian language is
defined as
    1. Lending of material values, money; loan.
    2. Budget amounts in the outlay, within which the costs of something are allowed [24].
    The semantic result obtained from the ('банк') - ('золото') + ('кредит') operation is the result of
redefining investments and loans into capital by making profits and surpluses. According to the
results of operations with the word "Капітал", it is seen that the word meanings are compatible with
the result of adding and subtracting vectors. Referring to the other vectors such as 'кредитний',
'інвестор', 'банкінг', 'позичальник' they are in the semantic affinity relationship.
    The first five word vectors with the closest cosine similarity to the result vector of ('спорт') -
('гімнастика') + ('баскетбол') operation are shown below.
    Cosine similarity results of ('спорт') - ('гімнастика') + ('баскетбол') operation result vector:

     [('бейсбол', 0.6380308866500854),
     ('гольф', 0.5946487784385681),
     ('футбол', 0.5910741090774536),
     ('крикет', 0.5801275372505188),
     ('регбі', 0.5448272824287415)]
     The word "Спорт" in the Academic Explanatory Dictionary of the Ukrainian language is defined
as
   1. Physical exercises (gymnastics, athletics, sports, tourism, etc.), which aim to develop and
   strengthen the body or mind [22].
   The ('спорт') - ('гімнастика') + ('баскетбол') operation is the transaction of exchange of two
sports branches in the vector operation. When the results are evaluated, the result vectors are
compatible in terms of meaning relations of the words they belong to.
   The clusters between the vectors obtained from the Ukrainian corpus, exampled above, were
examined considering the semantic relationship between the words they belong to. It is clearly seen
that semantic relations between Ukrainian words build clusters in the vectors. It is proved that
semantic results obtained by addition and subtraction on vectors obtained from the English corpus can
be also obtained from the Ukrainian corpus.

5.3. Formal clustering of Ukrainian word vectors
   Ukrainian is a language of inflexion. Affixal morphemes in Ukrainian are mainly suffixes and
sometimes prefixes. Even the number of suffixes considerably exceeds the number of prefixes. The
general feature of Ukrainian language is that the word roots are kept constant and the constructions
and inflections, which have various functions, are added to the roots. By adding different suffixes to
the roots of the word, new words are derived and the vocabulary of the language is formed in this
way.
   When searching similar words by using word vectors, it is seen that words ending with similar
suffixes can be reached [11]. The word vectors obtained from Ukrainian corpus (with Word2vec) are
clustered and related according to the Ukrainian-specific suffixes.
   The first word to be examined is the word "Ходити". The first five word vectors with the closest
cosine similarity to vector ('ходити') are shown below.
   The cosine similarity results of vector ('ходити'):

     [('бігати', 0.7297786474227905),
     ('їздити', 0.7247925996780396),
     ('лазити', 0.7013007402420044),
     ('сидіти', 0.7005398273468018),
     ('гуляти', 0.697922945022583)]

    The first five words clustered like a cosine are vectors belonging to the verbs in the form of
infinitive. The clustering of word vectors is related to the formal feature infinitive suffix "-ти".
    The word "Ходив" is derived by taking the past tense singular affix "-в" to the root "ход". The
first five word vectors with the closest cosine similarity to vector ('ходив') are shown below.
    The cosine similarity results of vector ('ходив'):

     [('бігав', 0.6834566593170166),
     ('водив', 0.678192138671875),
     ('вирушав', 0.6492938995361328),
     ('приїжджав', 0.6308779716491699),
     ('гуляв', 0.6269496083259583)]

   The first five words "бігав", "водив", "вирушав", "приїжджав", "гуляв" are vectors belonging to
the verbs in the form of the past tense singular derived by adding the past tense singular affix to the
verb roots. The clustering of word vectors is related to the formal feature the past tense singular affix
"-в".
   The word "Прачка" is derived by taking the feminine suffix "-к" to the root "прач". The job name
is derived from the verb describing the process. The top five word vectors with the closest cosine
similarity to vector ('прачка') are shown below.
   The cosine similarity results of the vector ('прачка'):
   [('кухарка', 0.6516879796981812),
   ('економка', 0.6106237173080444),
   ('хазяйка', 0.6105634570121765),
   ('покоївка', 0.5945062637329102),
   ('нянька', 0.581095814704895)]

   The first five words clustered like a cosine are vectors belonging to the nouns describing the job
name. The clustering of word vectors is related to the formal feature the feminine noun suffix "-к".
   The word "Яблуня" was discussed while analyzing the semantic relationships between vectors.
For the word "Яблуневий" in the sentence "Мені подобається яблуневий пиріг" the first five word
vectors with the closest cosine similarity to the vector ("Яблуневий") are shown below.
   The cosine similarity results of the vector ('яблуневий'):

   [('вишневий', 0.6773509979248047),
   ('квітучий', 0.6477996110916138),
   ('райський', 0.6233205199241638),
   ('фруктовий', 0.6139903664588928),
   ('виноградний', 0.6060866117477417)]

   The word "Яблуневий" in the sentence "Мені подобається яблуневий пиріг" is derived by
adding the suffix "-ев" which makes the word an adjective with the word meaning apple tree. Among
the vectors obtained by training from the Ukrainian corpus the closest cosine-like vector is
'вишневий'. The clustering of word vectors is related to the formal feature adjective suffix "-ев".
   The clusters between the vectors obtained from the Ukrainian corpus were examined considering
the formal relationship between the words. It is proved that the word vectors obtained from Ukrainian
corpus are clustered and related according to the Ukrainian-specific suffixes.

5.4. Arithmetic operations of word vectors and formal relation between
words
    New vectors can be also obtained as a result of adding and subtracting (arithmetic operations) the
word vectors obtained from the Ukrainian corpus by examining the formal clustering.
    The first five word vectors with the closest cosine similarity to the result vector of ('квіти') -
('квітка') + ('яблуко') operation are shown below.
    Cosine similarity results of ('квіти') - ('квітка') + ('яблуко') operation result vector:

   [('яблука', 0.5736010074615479),
   ('руно', 0.4630397856235504),
   ('перо', 0.4570971727371216),
   ('теля', 0.44505059719085693),
   ('курча', 0.4449284076690674)]

    According to the results of operations ('квіти') - ('квітка') + ('яблуко'), the first vector 'яблука',
clustered like a cosine, is a vector belongs to the formal feature plural form of nouns. The clustering
of word vectors 'руно', 'перо', 'теля', 'курча' is related to the formal feature neuter gender.
    The first five word vectors with the closest cosine similarity to the result vector of ('олівці') -
('олівець') + ('ручка') operation are shown below.
    Cosine similarity results of ('олівці') - ('олівець') + ('ручка') operation result vector:

   [('ручки', 0.6751487255096436),
   ('фольга', 0.6358448266983032),
   ('пластмаси', 0.6101338863372803),
   ('накладки', 0.6093952655792236),
   ('упаковка', 0.5968321561813354)]
    According to the results of operations ('олівці') - ('олівець') + ('ручка'), the vectors 'ручки',
'пластмаси', 'накладки', clustered like a cosine, is a vector belongs to the formal feature plural form
of nouns. The clustering of word vectors 'фольга', 'упаковка' is related to the formal feature noun,
feminine gender.
    The first five word vectors with the closest cosine similarity to the result vector of ('яблуневий') -
('яблуня') + ('вишня') operation are shown below.
    Cosine similarity results of ('яблуневий') - ('яблуня') + ('вишня') operation result vector:

   [('вишневий', 0.5016539096832275),
   ('гетсиманський', 0.48334306478500366),
   ('веселий', 0.45027756690979004),
   ('грушка', 0.4389912188053131),
   ('гефсиманський', 0.43634486198425293)]

   According to the results of operations ('яблуневий') - ('яблуня') + ('вишня'), the vectors
'вишневий', ‘гетсиманський’, 'веселий', 'гефсиманський', clustered like a cosine, is a vector
belongs to the formal feature adjective, masculine gender.
   The clusters between the vectors obtained from the Ukrainian corpus were examined considering
the formal relationship between the words. It is proved that the formal results obtained by addition
and subtraction on vectors are clustered and related according to the Ukrainian-specific suffixes.

6. Conclusions
    The research using machine technology Word2vec is of great practical importance to computerise
many areas of linguistic analysis. The results obtained regarding to the clustering of word vectors
proved that there is an affinity in the construction of words in both semantic and morphologic
similarity, that indicates a high structural level of construction of the Ukrainian language.
    Clustering of vectors obtained from the Ukrainian corpus can be in a semantic affinity relationship
with the words they belong to, form relations or both. Clusters can go down to sub-breakdowns.
    In the cosine similarities of the vector belonging to the word "Яблуня", the word "Яблуня" is
clustered with vectors belonging to lexical paradigm of the words representing the names of fruit
trees.
    In the cosine similarities of the vector belonging to the word "Картопля", the word "Картопля" is
clustered with vectors belonging to lexical paradigm of the words representing the names of
vegetables.
    In the cosine similarities of the vector belonging to the word "Харків", the words "Дніпро" and
"Полтава" represent the other cities names in Ukraine and belong to lexical paradigm of the words
related to the meaning of the word 'Харків'.
    In the cosine similarities of the vector belonging to the word "Микола", the neutral male name
used in Ukrainian is clustered together with other neutral male names.
    In the cosine similarities of the vector belonging to the word "Леся", the specific female name
used in Ukrainian is clustered together with other specific female names.
    In clustering proper names, separate clustering of male and female names is an example of sub-
break.
    In the cosine similarities of the vector belonging to the word "Білий", the first vector 'чорний' is in
a semantic antonyms relationship. Referring to the other vectors such as 'блакитний' and 'жовтий',
they belong to lexical paradigm of the words representing the names of colours, related to the
meaning of the word 'білий'.
    New vectors can be obtained as a result of adding and subtracting (arithmetic operations) the word
vectors obtained from the Ukrainian corpus. The cosine similarities of the vectors obtained by the
process were examined in terms of their compatibility with the meaning of the process. It is proved
that the semantic results that can be obtained by addition and subtraction on vectors obtained from the
English corpus can be also obtained from the Ukrainian corpus.
    The vector obtained as a result of ('король') - ('чоловік') + ('жінка') operation is ('королева') the
first vector among the cosine-like vectors. The process and result vectors are compatible with the
semantic relationships of the words they belong to.
    The vector obtained as a result of ('англія') - ('лондон') + ('київ') operation is ('україна') the first
vector among the cosine-like vectors. It is the transaction of the relationship between countries and
cities (or their capitals). The process and result vectors are compatible with the semantic relationships
of the words they belong to.
    The vector obtained as a result of ('банк') - ('золото') + ('кредит') operation is 'капітал' the first
vector among the cosine-like vectors. It is the result of redefining investments and loans into capital
by making profits and surpluses. According to the results of operations with the word "Капітал", the
word meanings are compatible with the result of adding and subtracting vectors. Referring to the other
vectors such as 'кредитний', 'інвестор', 'банкінг', 'позичальник' they are in the semantic affinity
relationship.
    The vector obtained as a result of ('спорт') - ('гімнастика') + ('баскетбол') operation is 'бейсбол'.
It is the transaction of exchange of two sports branches in the vector operation. When the results are
evaluated, the result vectors are compatible in terms of meaning relations of the words they belong to.
    Considering the morphological properties of the words, the vectors can be clustered according to
the suffixes they take. Clustering according to suffixes can also include semantic relationships of
words.
    In the cosine similarities of the vector belonging to the word "Ходити", the clustering of word
vectors is related to the formal feature infinitive suffix "-ти".
    In the cosine similarities of the vector belonging to the word "Ходив", the clustering of word
vectors is related to the formal feature the past tense singular affix "-в".
    In the cosine similarities of the vector belonging to the word "Прачка", clustering of word vectors
is related to the formal feature the feminine noun suffix describing the job name "-к".
    In the cosine similarities of the vector belonging to the word "Яблуневий", the clustering of word
vectors is related to the formal feature adjective suffix "-ев".
    New vectors can be also obtained as a result of adding and subtracting (arithmetic operations) the
word vectors obtained from the Ukrainian corpus by examining the formal clustering.
    According to the results of operations ('квіти') - ('квітка') + ('яблуко'), the first vector 'яблука',
clustered like a cosine, is a vector belongs to the formal feature plural form of nouns. The clustering
of word vectors 'руно', 'перо', 'теля', 'курча' is related to the formal feature neuter gender.
    According to the results of operations ('олівці') - ('олівець') + ('ручка'), the vectors 'ручки',
'пластмаси', 'накладки', clustered like a cosine, is a vector belongs to the formal feature plural form
of nouns. The clustering of word vectors 'фольга', 'упаковка' is related to the formal feature noun,
feminine gender.
    According to the results of operations ('яблуневий') - ('яблуня') + ('вишня'), the vectors
'вишневий', ‘гетсиманський’, 'веселий', 'гефсиманський', clustered like a cosine, is a vector
belongs to the formal feature adjective, masculine gender.
    More detailed analysis to computerise such areas of linguistic as constructing semantic maps of
various subject areas and expanding queries due to associative connections will be the subjects of our
further research.

7. References
[1] Britannica, The Editors of Encyclopaedia, Turing test, Artificial intelligence, 2020. URL:
    https://www.britannica.com/technology/Turing-test
[2] A. M. Turing, Computing Machinery and Intelligence, Mind (1950) 433–460.
    doi:10.1093/mind/lix.236.433
[3] T. Mendès France, Turing et son test : une alchimie contemporaine ? Notes sur les critiques des
    scientifiques des années 90, Quaderni (1996) 41–46. doi:10.3406/quad.1996.1953
[4] D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning representations by back-propagating
    errors, Nature (1986) 533–536. doi:10.1038/323533a0
[5] Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is
     difficult, IEEE Transactions on Neural Networks (1994) 157–166. doi: 10.1109/72.279181
[6] S. El Hihi, Y. Bengio, Hierarchical recurrent neural networks for long-term dependencies, in:
     Proceedings of the 8th International Conference on Neural Information Processing Systems,
     NIPS’95, MIT Press, Cambridge, MA, USA, 1995, pp. 493–499. doi:10.5555/2998828.2998898
[7] Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin, A neural probabilistic language model, Journal
     of Machine Learning Research (2003) 1137–1155. doi:10.1162/153244303322533223
[8] T. Mikolov, M. Karafiat, L. Burget, J Cernocky, S. Khudanpur, Recurrent neural network based
     language model, in: Proceedings of Interspeech-2010, International Speech Communication
     Association, Makuhari, Chiba, JP, 2010, pp. 1045–1048.
[9] Y. Bengio, Y. Lecun, Scaling learning algorithms towards AI, in: L. Bottou, O. Chapelle, D.
     DeCoste, J. Weston (Eds), Large-scale kernel machines, Mit Press, Cambridge, Mass, 2007.
     doi:10.7551/mitpress/7496.003.0016
[10] T. Mikolov, S. Kombrink, L. Burget, J. Černocký, S. Khudanpur, Extensions of Recurrent Neural
     Network Language Model, in: Proceedings of the 2011 IEEE International Conference on
     Acoustics, Speech, and Signal Processing, ICASSP, IEEE Signal Processing Society, Praha, CZ,
     2011, pp. 5528–5531. doi:10.1109/ICASSP.2011.5947611
[11] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in
     Vector Space, in: Proceedings of Workshop at ICLR 2013, Computation and Language
     Scottsdale, Arizona, USA, 2013. arXiv:1301.3781v3
[12] R. Lebret, R. Collobert, Word Embeddings through Hellinger PCA, in: Proceedings of the 14th
     Conference of the European Chapter of the Association for Computational Linguistics,
     Association for Computational Linguistics, Gothenburg, Sweden, 2014, pp. 482–490.
     doi:10.3115/v1/E14-1051
[13] Y. Chen, B. Perozzi, R. Al-Rfou, S. Skiena, The Expressive Power of Word Embeddings, in:
     Proceedings of the 30 th International Conference on Machine Learning, ICML 2013, Atlanta,
     Georgia, USA, 2013. arXiv:1301.3226
[14] T. Brants, A. Popat, P. Xu, F. Och, J. Dean, Large Language Models in Machine Translation, in:
     Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language, EMNLP-
     CoNLL, Association for Computational Linguistics, 2007, pp. 858–867.
[15] Y. Bengio, J.-S. Senecal, Quick Training of Probabilistic Neural Nets by Importance Sampling,
     in: Proceedings of AISTATS 2003. Society for Artificial Intelligence and Statistics, Florida,
     USA, 2003.
[16] J. Weizenbaum, ELIZA - a computer program for the study of natural language communication
     between man and machine, Communications of the ACM, (1966) 36–45.
     doi:10.1145/365153.365168
[17] T. Mikolov, Q.V. Le, I. Sutskever, Exploiting Similarities among Languages for Machine
     Translation; Computing Research Repository (CoRR), 2013. arXiv:1309.4168
[18] M. T. Sübay, Türkçe kelime vektörlerinde görülen anlamsal ve biçimsel yakınlaşmalar [The
     semantic and morphologic similarity in Turkish word embeddings]. Maltepe Üniversitesi, Sosyal
     Bilimler Enstitüsü, İstanbul, 2019. URL: https://hdl.handle.net/20.500.12415/2733
[19] A. Rysin, V Starko, D. Chaplynskyi, Slovnyk VESUM ta inshi poviazani zasoby NLP dlia
     ukrainskoi movy [VESUM dictionary and other related NLP tools for the Ukrainian language],
     2007. URL: https://r2u.org.ua/articles/vesum
[20] Gensim: topic modelling for humans. Corpus from a Wikipedia dump. URL:
     https://radimrehurek.com/gensim/corpora/wikicorpus.html
[21] Gensim:       topic      modelling     for    humans.        Word2vec       embeddings.      URL:
     https://radimrehurek.com/gensim/models/word2vec.html
[22] Slovnyk ukrainskoi movy. Akademichnyi tlumachnyi slovnyk [Academic Explanatory
     Dictionary of the Ukrainian language], 2018. URL: http://sum.in.ua/
[23] USE (Universalnyi slovnyk-entsyklopediia) [UDE (Universal Dictionary-Encyclopedia)],
     Slovopedia, 2007. URL: http://slovopedia.org.ua/29/53392-0.html
[24] Velykyi tlumachnyi slovnyk (VTS) suchasnoi ukrainskoi movy [Great explanatory dictionary of
     the         modern        Ukrainian        language],        Slovopedia,         2007.       URL:
     http://slovopedia.org.ua/93/53393/828300.html, http://slovopedia.org.ua/93/53399/882529.html