=Paper= {{Paper |id=Vol-2718/paper14 |storemode=property |title=Tree Structure of Slovak Sentences |pdfUrl=https://ceur-ws.org/Vol-2718/paper14.pdf |volume=Vol-2718 |authors=Michaela Linková,Stanislav Krajči |dblpUrl=https://dblp.org/rec/conf/itat/LinkovaK20 }} ==Tree Structure of Slovak Sentences== https://ceur-ws.org/Vol-2718/paper14.pdf
                              Tree structure of Slovak sentences

                                     Michaela Linková1 and Stanislav Krajči

                 Institute of Computer Science, Pavol Jozef Šafárik University in Košice, Slovakia,
                                    michaela.linkova@student.upjs.sk

Abstract: In this work-in-progress paper, we                   The natural language processing in the Slovak
propose the algorithm for creation of a tree                 language is a very actual and fruitful area of
structure of the Slovak sentence.          The tree          research interest. It is the most challenging issue
structure of a sentence represents the relationships         to arise in recent years. However, we can find
and dependencies between words in a sentence.                several tools, dictionaries and conferences in this
The root of the tree is predicate. Finding the right         area of research. For example, Paper Online
sentence structure helps to understand its meaning           Natural Language Processing of the Slovak
better. In Slovak language, words have different             Language presents the web site for NLP tools
forms, and there are various ways how to compose             such as lemmatization, correction, finding part-of-
sentences. We found that the algorithm properly              speech of words in the sentence and others [1].
works for simple sentences with one predicate                Paper Morphological analysis of Slovak language
and subject and sentences connected with one                 introduces a statistic algorithm of segmenting
conjunction.                                                 words by identification of a suffix. This ability
                                                             to identify suffix helps to classify even unseen
                                                             words in the training corpus [2]. It is necessary to
1    Introduction and related works                          have more information about words to work with
                                                             them. Language Institute of L’udovit Štúr offers
The natural language processing in the Slovak                a wide selection of dictionaries. These include
language is a very actual and fruitful area of               a Dictionary of the Slovak language, Slovak
research interest. It is the most challenging issue          spelling rules dictionary, Dictionary of foreign
to arise in recent years. The Slovak language                words, Synonymous dictionary and much more
belongs to a group of a flexible language and                [3]. It also provides the Slovak National Corpus.
has complex rules for word inflection as there               It is an electronic database, mainly containing
are many possible word forms, classification of              Slovak texts from 1955 from different styles,
contexts. One part of natural language processing            genres, thematic areas, region and other. Every
is understanding the structure of sentences. This            two years, the institute organizes a conference
work-in-progress paper proposes the algorithm for            SLOVKO on natural language processing [4].In
creation of a tree structure of the Slovak sentence.         2017, Slovak Dependency Treebank in Universal
Algorithm is not base on statistical data from the           Dependencies was created form Slovak National
corpus, but takes raw data from Tvaroslovník.                Corpus.[5] Database of Slovak words and their
It is a database of all forms of Slovak words.               forms Tvaroslovník was created at Pavol Jozef
The tree structure of a sentence can represent the           Šafárik University at Košice [6], [7]. Master thesis
relationships and dependencies between words in              Syntaktická analýza slovenskej vety pomocou
a sentence. The root of the tree is a predicate.             Tvaroslovníka deals with the creation of an
The tree structure for Slovak sentence: "Lucia číta         algorithm for finding the structure of the sentence
vel’mi peknú knihu." 1 is shown in Figure 1.                 [8]. Statistical and machine learning approaches
                                                             have been developed in recent years. For example
                                                             two-stage multilingual dependency parser, which
       Copyright c 2020 for this paper by its authors. Use   was evaluate on 13 diverse languages including
permitted under Creative Commons License Attribution 4.0
                                                             Czech language [9] or a neural network classifier
International (CC BY 4.0).
     1 Lucy is reading very beautiful book.                  for use in a greedy, transition-based dependency
        Figure 1: Example of sentence tree structure sentences "Lucia číta vel’mi peknú knihu."2


parser [10].                                           2   Slovak syntax
                                                         Grammatical rules define the structure of a
                                                       sentence. A sentence member is a basic unit of
                                                       a sentence. It is part of a sentence, which is in
                                                       some relation to the other parts. Main sentence
   Slovak and Czech language have similar              members are subject and predicate.
grammatical rules and structure of sentences.
Therefore tools and algorithms for natural               Subject is part of the sentence, which describes
language processing of the Czech language can          who or what is doing something. It could be noun,
also be inspiration for algorithms for the Slovak      adjective, pronoun or numeral. Subject should be
Language. There are three main universities            in the first case. In Slovak sentences, subject could
in the Czech republic that deals with natural          be expressed or unexpressed. Unexpressed subject
language processing. One is Masaryk University         means that subject is not in this sentence. For
in Brno and the other is Charles University in         example, in the sentence "Nakupujeme", which
Prague. Masaryk university created tools for           means We are shopping, is only verb and no
adding diacritics into texts, topics detection,        subject. Verb "Nakupujeme" is in plural and
named entity recognition, morphological analyzer       in first person form, it is noticeable in Slovak
"Majka" and other [11]. Institute of Formal            language that subject is "we". Predicate expresses
and Applied Linguistic at Charles University           an action or situation of the subject. It could
developed tools for annotation, tagging,               be verb or auxiliary verb plus noun, adjective or
correction and also morphological analyzer and         numeral. In Figure 1., subject is "Lucia" and
other [12]. This institute develops s a trainable      predicate is "číta".
pipeline "UDPipe" which performs sentence
segmentation, tokenization, lemmatization and            Other sentence members are an object, attribute
dependency parsing[13]. The third is Institute         and adverbial. Object specifies predicate and
of Theoretical and Computational Linguistics           shows object. It could be noun or pronoun and
at Charles University. This institute develops         it should not be in the first case. Adverbial
computational tools for automatic language             gives more information about time, place, manner
processing, for example syntactic annotation of        or cause. Adverbial could be adverb or noun.
Czech corpora or grammar-based treebank of             Attribute modifies subject and it is adjective,
Czech [14].                                            pronoun or numeral. There are two types of
attribute.    One type has to have the same               • categories: grammatical categories, there are
grammatical categories as subject and the other             different for every part-of-speech.
type has at least one of the grammatical categories
different from subject. In Figure 1., object is         Table 1. shows an example of records for the word
"knihu" and attribute is "peknú". In the sentence       "kniha" 5 .
"Deti prišli večer."3 , adverbial is "večer". Every
sentence member except for predicate can be in            The algorithm for finding a tree structure of a
a sentence more than one time. Conjunction or           sentence has Slovak text as input. In the first
commas connect same sentence members. These             step, an input is split into sentences by using a
sentence members create a relation between them.        dot as a separator. Sentences are put into list of
The word which depends on the other word in             sentences. Then algorithm iterates over list of
relation is called dependent and another word is        sentences. It finds all forms and characteristics for
superior.                                               each word in a sentence by using Tvaroslovník.
                                                        In the next step, the algorithm finds out how
  The Slovak language distinguishes three types         many of a predicates sentence has. If there is
of sentence: simple, compound and fundament             more than one predicate, the algorithm separates
sentence. Simple sentence has only one predicate.       sentence into smaller parts. Separation is done
Compound sentence has two or more predicates.           according to conjunction or coma. The next steps
Part of compound sentence are connected by              are same as steps for sentences with one predicate.
conjunction or commas. The third type has               However, the algorithm uses these smaller parts
no subject, only verb or noun. Verb in these            instead of a whole sentence. After identifying the
sentences usually describes general action, for         type of sentence, algorithms checks how many
example "Prší" 4 [15].                                  of same sentence members are in a sentence. If
                                                        there is more than one, the algorithm takes this
                                                        part of a sentence and creates a relation between
3     Finding structure of sentence                     each member and comma or conjunction, which
                                                        is connecting these members. These relations
   Tvaroslovník is a database of all forms of           are added to the list of possible relations of
Slovak words. It was created at the University          sentence. After that, the algorithm takes tuples of
of Pavol Jozef Šafárik. The database contains           words, which are standing next to each other and
a lot of Slovak words, their form and other             tries to choose the possible relationship between
information about these words.        Every row         those words. The algorithm has defined 29
contains information about form of the word, its        types of relations between words. Except for
part-of-speech and grammatical categories of the        the definition of part-of-speech and grammatical
word. Data in Tvaroslovník was collected from           categories, every relation has its priority defined
the dictionary of Slovak language. All data and         by empirical experience. All types of relations and
information are saved in one table. There is a list     their priorities are presented in Table 2.
of columns:
                                                          For each tuple algorithm iterates over all form
    • idWord: unique identification number for          of words of tuple and try to find suitable word
      word,                                             forms according to possible types of relations.
    • idForm: unique identification number of           After choosing the type of relation for each
      word’s form,                                      tuple, relations are put in the list of possible
                                                        relations and sorted according to priority from
    • form: a form of a word,                           the highest to the smallest. If there is more
                                                        than one possible relation, algorithm put into
    • part-of-speech,                                   list of possible relations all relations. Next, the
                                                        algorithm selects the relation with the highest
     3 Children came in the evening.
     4 It is raining.                                      5 book
 idWord       idForm        form            part-of-speech   categories
    1            0          kniha                noun        gender: feminine; number: singular; case: nominative
    1            1          knihy                noun        gender: feminine; number: singular; case: genitive
    1            2          knihe                noun        gender: feminine; number: singular; case: dative
    1            3          knihu                noun        gender: feminine; number: singular; case: accusative
    1            4          knihe                noun        gender: feminine; number: singular; case: locative
    1            5          knihou               noun        gender: feminine; number: singular; case: instrumental
    1            6          knihy                noun        gender: feminine; number: plural; case: nominative
    1            7          kníh                 noun        gender: feminine; number: plural; case: genitive
    1            8          knihám               noun        gender: feminine; number: plural; case: dative
    1            9          knihy                noun        gender: feminine; number: plural; case: accusative
    1           10          knihách              noun        gender: feminine; number: plural; case: locative
    1           11          knihami              noun        gender: feminine; number: plural; case: instrumental

                                                 Table 1: Tvaroslovnik


priority at the list. Selected relation is added to           a list of words from a sentence and finds all
the list of final relations. This list helps built            possible forms from Tvaroslovník and puts it to
the tree. After that, the tuples are removed from             the map. In our example, the algorithm generates
the list of possible relations, and the dependent             the following map:
word is removed from a sentence. Every relation,
which has the same dependent word and other                     • Lucia: [idWord: 128848, idForm: 1, form:
superior word as selected relation, is removed                    Lucia, part-of-speech: noun, categories:
from the list of possibilities. Then words next                   gender: feminine; number: singular; case:
to the removed word are candidates for a new                      nominative]
relation, therefore algorithm checks if they can
form relation. If there is some type of relation,               • číta: [idWord: 8679, idForm: 6, form:číta,
this new relation is added to the list of possible                part-of-speech: verb, categories: person:
relations, and then the list is sorted according                  third; number: singular; time: present]
to priority again. If there are two relation with
                                                                • vel’mi: [idWord: 102690, idForm: 0,form:
same dependent and superior word form, but they
                                                                  vel’mi, part-of-speech: adverb, categories:
have different type of relation as selected relation,
                                                                  None]
algorithm creates copy of list of possible relation.
Algorithm creates from copied list other possible               • peknú: [idWord: 56578, idForm: 17, form:
tree structure with different types of relations. The             peknú, part-of-speech: adjective, categories:
process is repeated again until there is only one                 gender: feminine; number: singular; case:
word in a sentence.The last step is to build a tree               accusative]
structure from the list of final relations. Algorithm
for finding tree structure is implemented in Java               • knihu: [idWord: 27834, idForm: 4, form:
programming language and data are stored in Java                  knihu, part-of-speech: noun, categories:
Collections classes List and Map. If there is some                gender: feminine; number: singular; case:
ambiguity in relation, algorithm finds all possible               accusative]
tree structure of sentences
                                                              After that, the map of possible relations is created
  The example below illustrates the work of the               with their priorities and it is sorted according to
algorithm. Input is the sentence "Lucia číta                 priority:
vel’mi peknú knihu." 6 First, the algorithm creates
                                                                • dependent:     vel’mi and superior:     peknú,
   6 Lucy is reading very beautiful book.                         priority: 9
 Dependent        Superior       Priority           Required grammatical categories                                 Type of relation
    Noun         Conjunction       10                              None                                  Multiple sentence member with noun
  Adjective      Conjunction       10                              None                                 Multiple sentence member with adjective
  Pronoun        Conjunction       10                              None                                 Multiple sentence member with pronoun
  Numeral        Conjunction       10                              None                                 Multiple sentence member with numeral
   Adverb          Adverb           9                              None                                            Complex adverbial
   Adverb         Adjective         9                              None                                            Complex attribute
  Pronoun         Pronoun           9                              None                                               Two pronoun
Pronoun sa, si      Verb            8                              None                                              Reflexive verb
 Preposition      Adjective         7                           Same case                                      Preposition with adjective
 Preposition      Pronoun           7                           Same case                                      Preposition with pronoun
 Preposition        Noun            6                           Same case                                        Preposition with noun
Auxiliary verb      Verb            6                              None                                            Complex predicate
Auxiliary verb      Noun            6                              None                                            Complex predicate
Auxiliary verb    Adjective         6                              None                                            Complex predicate
Auxiliary verb    Pronoun           6                              None                                            Complex predicate
  Pronoun         Adjective         5                 Same gender, number and case                                 Complex attribute
  Pronoun           Noun            4                 Same gender, number and case                                  Pronoun attribute
  Adjective         Noun            4                 Same gender, number and case                                 Adjective attribute
    Noun            Noun            3        Nouns shouldn’t have same gender, number or case                           Attribute
 Preposition        Verb            3                              None                                           Preposition with verb
  Pronoun           Verb            2          Pronoun shouldn’t be in the nominative case                           Pronoun object
    Noun            Verb            2            Noun shouldn’t be in the nominative case                             Noun object
  Adjective         Verb            2            Noun shouldn’t be in the nominative case                           Adjective object
  Numeral           Verb            2               Noun shouldn’t be nominative case                                Numeral object
   Adverb           Verb            2                              None                                                Adverbial
    Noun            Verb            1                Noun should be in the first case                                 Noun subject
  Adjective         Verb            1               Adjective should be in the first case                           Adjective subject
  Pronoun           Verb            1               Pronoun should be in the first case                             Pronoun subject
  Numeral           Noun            1               Numeral should be in the first case                             Numeral subject

                                      Table 2: Relations and their priorities


           • dependent:    peknú and superior:        knihu,     checks if a new relation is created. After removal,
             priority: 4                                         there is a new possible relation between words číta
                                                                 and peknú with priority 2. This new relation is
           • dependent:     vel’mi and superior:        číta,   added to the list of possible relations, and the list
             priority: 2                                         is sorted again. After the first iteration, there is
                                                                 the sentence "Lucia číta peknú knihu."7 .The list
           • dependent:     Lucia and superior:         číta,   of final relations is:
             priority:1
                                                                    • dependent:       vel’mi and superior:    peknú,
        First iteration Relation with priority 9 is chosen.           priority: 9
        This relation is added to the list of final relations
                                                                 And the list of possible relations is:
        and removed from the list of possible relations.
        The algorithm is going through the list of                  • dependent:        peknú and superior:     knihu,
        possible relations and removes every relation                 priority: 4
        with dependent word vel’mi. Dependent word
        is removed from a sentence, and the algorithm                7 Lucy is reading beautiful book
  • dependent:          peknú and superior:      číta,       • dependent: knihu and superior: číta, priority:
    priority: 2                                                 2
  • dependent:          Lucia and superior:      číta,   And the list of possible relations is:
    priority:1
                                                              • dependent:         Lucia and superior:     číta,
Second iteration Relation with priority 4 is                    priority:1
selected. This relation is added to the list of final
relations and removed from the list of possible           Fourth iteration Relation with priority 1 is
relations. the algorithm is going through the list        chosen. This relation is added to the list of final
of possible relations and removes every relation          relations and removed from the list of possible
with dependent word peknú. Dependent word is              relations. Dependent word is removed from
removed from sentence, and the algorithm checks           the sentence. Only last word remains in the
if a new relation is created. There is a new              sentence and the list of possible relations is empty,
possible relation between words číta and knihu           therefore the fourth iteration is the last one and all
with priority 2. This new relation is added to the        needed relations are in the list of final relations.
list of possible relations, and list is sorted again.     The list of final relations is:
After second iteration, the sentence is "Lucia číta          • dependent:        vel’mi and superior:   peknú,
knihu."8 . The list of final relations is:                      priority: 9
  • dependent:          vel’mi and superior:   peknú,
                                                              • dependent:        peknú and superior:    knihu,
    priority: 9
                                                                priority: 4
  • dependent:          peknú and superior:    knihu,
                                                              • dependent: knihu and superior: číta, priority:
    priority: 4
                                                                2
And the list of possible relations is:
                                                              • dependent:         Lucia and superior:     číta,
  • dependent: knihu and superior: číta, priority:             priority:1
    2
                                                            In the last step, a tree structure is built from
  • dependent:          Lucia and superior:      číta,
                                                          the list of final relations. Figure 2. illustrates the
    priority:1
                                                          gradual construction of a tree structure. (a),(b),(c)
Third iteration Relation with priority 2 is               show a partial tree of a sentence after iterations,
selected. This relation is added to the list of           and (d) gives an output of a whole sentence after
final relation and removed from the list of possible      the fourth iteration.
relations. The algorithm is going through the list
of possible relations and removed every relation          4     Conclusions and further research
with dependent word knihu. Dependent word
is removed from sentence and it checks if a                In this paper, we have proposed the algorithm for
new relation is created. However, there is no             creating a tree structure of the Slovak sentence.
possibility to create a new relation. After third         We have presented the running of our algorithm on
iteration, the sentence is "Lucia číta.".9 The list      an example of Slovak sentence "Lucia číta vel’mi
of final relations is:                                    peknú knihu". However, the Slovak language
  • dependent:          vel’mi and superior:   peknú,     has various orders of words and combinations of
    priority: 9                                           sentence members. The algorithm also finds a tree
                                                          structure of sentences:
  • dependent:          peknú and superior:    knihu,
    priority: 4                                               • simple sentence, for example "Lenka
                                                                nakupuje oblečenie."10
   8 Lucy is reading book.
   9 Lucy is reading.                                         10 Lenka is buying clothes.
                                                                      • simple sentence with a complex predicate,
                                                                        for example "Mal by som už íst’." 16
                                                                      • compound sentence with more predicates,
                                                                        for example "Na záhrade máme červené
                                                                        tulipány, ktoré nám darovala stará mama a
        (a) First iteration.                                            vedl’a tulipánov je záhon ruží." 17
                                 (b) Second iteration.              The testing of the algorithm for the various types
                                                                    of Slovak sentences and the visualization of the
                                                                    outputs in the user friendly environment are the
                                                                    main objectives of our future research.


                                                                    References
                                                                     [1] D. Hladek, S. Ondáš, and J. Staš, “Online natural
                                                                         language processing of the slovak language,” 11
                                                                         2014.
                                                                     [2] D. Hladek, J. Stas, and J. Juhar, “Morphological
        (c) Third iteration.      (d) Fourth iteration.                  analysis of the slovak language,” Advances in
                                                                         Electrical and Electronic Engineering, vol. 13,
   Figure 2: Tree structure after each iteration.                        no. 4, pp. 289–294, 2015.
                                                                     [3] “Slovenské      slovníky.”   https://slovnik.
                                                                         juls.savba.sk/?d=pskcs&d=psken&d=
   • simple sentence without subject, for example                        locutio&d=ma. (Accessed on 06/09/2020).
     "Kráčame do školy."11                                          [4] R. Garabík, “Slovenský národný korpus,” 2020.
   • simple sentence with same sentence member,                      [5] D. Zeman, “Slovak dependency treebank
     for example "Milý a pekný Martin kupuje                             in universal dependencies,”          Journal of
                                                                         Linguistics/Jazykovednỳ casopis, vol. 68, no. 2,
     kvety Lucke."12
                                                                         pp. 385–395, 2017.
   • compound sentence with two predicates, for                      [6] N. R. Krajči S., “Tvaroslovník – databáza tvarov
     example "Deti sa hrajú na ihrisku a rodičia sa                     slov slovenského jazyka,” in zborník príspevkov z
     rozprávajú."13                                                      pracovného seminára ITAT, pp. 57–61, 2012.
                                                                     [7] N. R. Krajči S., “Projekt tvaroslovník – slovník
   • compound sentence with two predicates and                           všetkých tvarov všetkých slovenských slov,” in
     same sentence member, for example "Pekný                            Znalosti 2012, pp. 109–112, Vydavatelství MFF
     a upravený dom stojí na kraji ulice a bývajú                        UK, 2012.
     v ňom dvaja l’udia."14                                         [8] J. Hil’ovská, Syntaktická analýza slovenskej vety
                                                                         pomocou Tvaroslovníka. PhD thesis, 2017.
 We aim to improve the presented algorithm in our                    [9] R. McDonald, K. Lerman, and F. Pereira,
future research. Particularly, our objective is to                       “Multilingual dependency analysis with a two-
analyze the following special types of sentences:                        stage discriminative parser,” in Proceedings of
                                                                         the Tenth Conference on Computational Natural
   • fundament sentence, for example "Prší." 15                          Language Learning (CoNLL-X), pp. 216–220,
                                                                         2006.
                                                                    [10] D. Chen and C. D. Manning, “A fast and accurate
   11 We are going to school.
                                                                         dependency parser using neural networks,”
   12 Nice and handsome Martin buys flowers for Lucy.
                                                                         in Proceedings of the 2014 conference on
   13 Children are playing on a playground and parents talk.
   14 A beautiful and tidy house stands on the side of the street      16 I should go now.

and two people are living in it.                                       17 We have red tulips in the garden that our grandmother
   15 It is raining.                                                gave us, and next to the tulips is a bed of roses.
     empirical methods in natural language processing
     (EMNLP), pp. 740–750, 2014.
[11] “Natural language processing centre.” https://
     nlp.fi.muni.cz/en/NLPCentre. (Accessed
     on 06/09/2020).
[12] “Tools | Úfal.” https://ufal.mff.cuni.cz/
     tools. (Accessed on 06/09/2020).
[13] M. Straka, “Udpipe 2.0 prototype at conll 2018 ud
     shared task,” in Proceedings of the CoNLL 2018
     Shared Task: Multilingual Parsing from Raw Text
     to Universal Dependencies, pp. 197–207, 2018.
[14] “Utkl.” http://utkl.ff.cuni.cz/en/utkl.
     html. (Accessed on 06/09/2020).
[15] J. Pavlovič, Syntax slovenského jazyka I. 2012.