Syntactic analysis of the Slovak sentence

                                                 Michaela Vočková and Stanislav Krajči

                            Institute of Computer Science, Pavol Jozef Šafárik University in Košice, Slovakia,
                                               michaela.vockova@student.upjs.sk

Abstract: The natural language processing is recently a                       • Part of speech tagging describes a sentence, deter-
very discussed topic in computer science. The main idea                         mines the part of speech for each word.
is an understanding of human languages by computers. In
                                                                          Some of the tasks can be used as a subtask for more com-
this work-in-progress paper, we propose the algorithm for
                                                                          plex assignments [1].
creation of a tree structure of the Slovak sentence. The tree
                                                                          Semantic and syntactic parsing is also part of natural lan-
structure of a sentence represents the relationships and de-
                                                                          guage processing, aiming to provide internal relations be-
pendencies between words in a sentence. The root of the
                                                                          tween words. There are two approaches for finding the
tree is a predicate. Understanding a structure of sentence
                                                                          structure of sentence: constituent parsing and dependency
is important for other natural language processing tasks,
                                                                          parsing. Constituent parsing provides a constituent tree
such as semantic analysis. There are many different types
                                                                          where nodes are phrases. The goal is to find these phrases
of sentences in the Slovak language, which we took into
                                                                          and their relations. The approaches of constituent pars-
account for creating the algorithm. For example, a mul-
                                                                          ing include the chart-based and the transition-based mod-
tiple sentence member, compound sentence, compound
                                                                          els. Both have statistical and neural models. Dependency
predicate and others. Our algorithm correctly analysed 85
                                                                          parsing is using bilexicalized dependency grammar, which
sentences from 100 different sentences.
                                                                          contains all semantic and syntactic dependencies. Depen-
                                                                          dency parsing models are divided into two groups: graph-
1    Introduction                                                         based models and transition-based models, both of which
                                                                          have their own statistical or neural network approaches
Natural language processing is part of artificial intelli-                [2].
gence and linguistics, focusing on understanding human                    This work-in-progress paper proposes the improvement of
language by computers. There are different tasks in natu-                 algorithm for creation of a tree structure of the Slovak sen-
ral language processing:                                                  tence [19]. This algorithm is not based on statistical data
                                                                          from the corpus, but takes raw data from Tvaroslovník. It
    • Automatic summarization provides summaries or de-                   is a database of all forms of Slovak words. The tree struc-
      tailed information of text of a known type.                         ture of a sentence can represent the relationships and de-
                                                                          pendencies between words in a sentence. The root of the
    • Co-reference resolution refers to a sentence or more
                                                                          tree is a predicate. The tree structure for Slovak sentence:
      extensive set of text determining which word refers
                                                                          Hodina dnes za£ala malým kvízom.1 is shown in
      to the same object.
                                                                          Figure 1.
    • Discourse analysis refers to the task of identifying the
      discourse structure of a text.                                      2    State of Art
    • Machine translation refers to automatic translation of
                                                                          Institute of Formal and Applied Linguistic at Charles Uni-
      text from one human language to another.
                                                                          versity in Prague has created the Prague Dependency Cor-
    • Morphological segmentation refers to separate words                 pus, which is an excellent contribution to natural language
      into individual morphemes and identifies the class of               processing. Several tools have been developed to find out
      the morphemes.                                                      a sentence structure or work on other natural language pro-
                                                                          cessing tasks based on this corpus or Universal Depen-
    • Named entity recognition describes a stream of text                 dency Treebank. For example [3]:
      and determines which text items relate to proper
      names.                                                                  • Netgraph – this is a graphically oriented client-server
                                                                                application for searching in an annotated corpus.
    • Optical character recognition gives an image repre-
                                                                              • TrEd – an editor used to search for a syntactically
      senting printed text, which helps determine the corre-
                                                                                annotated sentence structure.
      sponding or related text.
                                                                              • Morfo – a system for morphological analysis of the
                                                                                Czech language.
     Copyright ©2021 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).           1 The class starts with a small quiz today.
                                                                za£ala


                                         Hodina        dnes                   kvízom


                                                                   malým

              Figure 1: Example of sentence tree structure for Hodina dnes za£ala malým kvízom.1


   • MorfoDita – a free tool for morphological analysis of         For example, DEVELOPER visualizes an occurrence of
     natural language texts.                                       one or two words in the corpus. DIAKRITIK corrects the
                                                                   diacritics, and KOLOKAT visualizes distances between
   • Moses – a statistical machine translation system that         two terms in the corpus [12]. Every two years, the insti-
     automatically allows training translation models for          tute organizes a conference SLOVKO on natural language
     any language pair.                                            processing [13]. In 2017, D. Zeman presented an article
   • UDPipe – a trainable channel for tokenization, label-         Slovak Dependency Treebanks in Universal Dependencies
     ing, lemmatization, and relationship analysis. Insti-         about converting the syntactically annotated part of the
     tute developed two version of UDPipe [4], [5].                Slovak National Corpus into the annotation scheme known
                                                                   as Universal Dependencies. Universal Dependencies is
The Natural Language Processing Centre at Masaryk Uni-             an international standard and also the largest database of
versity in Brno is mainly engaged in research into the pro-        freely available dependency treebank[14]. Database of
cessing of the Czech, English, and Slovak languages. They          Slovak words and their forms Tvaroslovník was created at
deal with morphological, syntactic, and semantic analysis          Pavol Jozef Šafárik University at Košice [15], [16]. Mas-
and the creation of corpora and dictionaries. The insti-           ter thesis [17] deals with the creation of an algorithm for
tute has created several tools that work with morpholog-           finding the structure of the sentence.
ical, syntactic, and semantic analysis. Examples include
[6]:
                                                                   3     Dictionaries
   • Majka – morphological analyzer for Slovak, Czech,
     Polish, Swedish, German language.                             It is necessary to have more information about words to
   • The Sketch engine – a tool used to search for infor-          create a sentence structure. Therefore we are using the
     mation from text corpora.                                     dictionary Tvaroslovník and Valency dictionary for our al-
                                                                   gorithm of syntactic analysis.
   • CZ accent – a tool for adding accents to text.
   • Synt and SET – parsers used to determine the struc-           3.1   Tvaroslovník
     ture.
                                                                   Tvaroslovník is a database of all forms of all Slovak words
   • Visual Browser - Java software that visualizes data           from [8] and [9]. Every row contains information about
     into RDT format.                                              form of the word, its part-of-speech and grammatical cat-
                                                                   egories of the word. Data in Tvaroslovník was collected
Institute of Theoretical and Computational Linguistics at
                                                                   from the dictionary of Slovak language. Database contains
Charles University develops computational tools for au-
                                                                   approximately 220,000 words and 24,000,000 records of
tomatic language processing, for example, syntactic an-
                                                                   words and all their forms. All data and information are
notation of Czech corpora or grammar-based treebank of
                                                                   saved in one table. There is a list of columns:
Czech language. [7].
Similar to the Czech language, there are several tools, dic-           • idWord – unique identification number for word,
tionaries, and conferences in natural language processing
research in Slovak languages. Language Institute of L’u-               • idForm – unique identification number of word’s
dovít Štúr offers a wide selection of dictionaries. These in-            form,
clude a [8], [9], [10] and much more [11]. It also provides
the Slovak National Corpus. It is an electronic database,              • form – a form of a word,
mainly containing Slovak texts from 1955 from different                • part-of-speech,
styles, genres, thematic areas, region and other. Language
Institute of L’udovít Štúr developed tools for searching               • categories – grammatical categories, there are differ-
words in Slovak National Corpus and working with them.                   ent for every part-of-speech.
Table 1 shows an example of records for the word hodina            input: sentence
2.                                                                 output: tree structure of sentence
                                                                   find all forms for words in sentence from Tvaroslovník;
                                                                   create list of possible relations;
3.2     Valency dictionary                                         while list of possible relation is not empty or sentence
Valency dictionary contains two types of the most com-               has only one word do
mon covalence between words. First is covalence be-                     choose relation with greatest priority;
tween verb and preposition or verb and the most com-                    add chosen relation to list of final relations;
                                                                        remove chosen relation from list
mon case of the following term. Covalence between
                                                                        of possible relations;
noun and preposition is the second type of valency dic-
                                                                        foreach relation in list of possible relation
tionary. To built the valency dictionary, we took noun                  do
and verbs from Tvaroslovník and covalencies with prepo-                        if relation has same dependent
sitions and cases were automatically created from exam-                        and different superior word as chosen
ples in Krátky slovník slovenského jazyka [18]. Dictio-                        relation then
nary cointans columns:                                                               remove relation from list
                                                                                     of possible relations;
    • idWord — unique identification number for word
                                                                               end
      from Tvaroslovník,                                                end
    • preposition — preposition which follow after noun or              remove dependent word of chosen relation
      verb,                                                             from sentence;
                                                                        if new possible relation is created then
    • case — case of word after noun or verb.                                  add new relation to list of possible
                                                                               relations;
Table 2 illustrates examples from dictionary of covalence.              end
                                                                   end
                                                                   build tree structure from list of final relations;
4      Tree structure of sentence                                 Algorithm 1: Pseudocode for finding tree structure al-
                                                                  gorithm
We presented the main idea of the algorithm for finding
the tree structure in the article [19]. For the algorithm,
we expanded the table of relations and added cases of Slo-         • Multiple verbs in sentence: Occurrence of several
vak sentences, which we describe in the subsection Special           verbs in a sentence is another specification of the sen-
cases of sentences. Table 3 illustrates the new relationship         tence. Before we start looking for possible relation-
table, and algorithm 1 describes the pseudocode for the              ships in a sentence, we determine if this is not the
main idea of the tree finding algorithm.                             case. After determining verbs, we search whether a
                                                                     conjunction or a comma is in the sentence between
4.1     Special cases of sentences                                   them. Finding a comma or conjunction classifies a
                                                                     sentence as a sentence. Therefore, we divide the sen-
Slovak is a flexible language and has many peculiarities             tence according to the conjunction or comma into
that we took into account when creating the method.                  subsections with which we work as separate sen-
    • Multiple sentence member: The first is multiple                tences. We connect these sentences with the rela-
      sentence members. We find out whether there is                 tionships between the conjunction or comma and the
      a conjunction or a comma in the sentence during                roots of subsentences in the resulting output. Figure 2
      searching for initial possible relations. If so, we look       shows us example of sentence structure for sentence
      at the word before and after the conjunction if it is the      Mama £íta noviny a otec pí²e správu.4 In a
      same part of speech and has the same grammatical               sentence containing more verbs without conjunction
      categories. After fulfilling the condition, we add a re-       or comma between them, we assume that there is a
      lation between conjunction and the words to the pos-           compound verb relation. Therefore, we combine the
      sible relations. The conjunction then takes over the           found verbs with the relation and add them to the list
      grammatical categories of the words it connects. For           of possible relations. Figure 3 shows us example of
      example, in sentence Noviny a £asopisy pí²u o                  such sentence structure for sentence Ráno za£alo
      celebritách.3 words noviny and £asopisy are                    pr²a´.5
      same sentence member, therefore there are relations
                                                                   • Same form of word: Some words have the same
      noviny and a with priority 12 and £asopisy and a               form in several cases, so it is sometimes difficult to
      with priority 12 in the list of possible relations. Word
                                                                     determine which relationship they can form. We find
      a participates as noun in nominative case.
      2 hour                                                        4 Mother is reading newspapers and father is writing an message.
      3 Newspapers and magazines write about celebrities.           5 It started to rain in the morning.
   idWord     idForm     form          part-of-speech    categories
   20009      0          hodina        noun              gender: feminine; number:                  singular;
                                                         case: nominative
   20009      1          hodiny        noun              gender: feminine; number:                  singular;
                                                         case: genitive
   20009      2          hodine        noun              gender: feminine; number:                  singular;
                                                         case: dative
   20009      3          hodinu        noun              gender: feminine; number:                  singular;
                                                         case: accusative
   20009      4          hodina        noun              gender: feminine; number:                  singular;
                                                         case: vocative
   20009      5          hodine        noun              gender: feminine; number:                  singular;
                                                         case: locative
   20009      6          hodinou       noun              gender: feminine; number:                  singular;
                                                         case: instrumental
   20009      7          hodiny        noun              gender: feminine; number:                  plural;
                                                         case: nominative
   20009      8          hodín         noun              gender: feminine; number:                  plural;
                                                         case: genitive
   20009      9          hodinám       noun              gender: feminine; number:                  plural;
                                                         case: dative
   20009      10         hodiny        noun              gender: feminine; number:                  plural;
                                                         case: accusative
   20009      11         hodiny        noun              gender: feminine; number:                  plural;
                                                         case: vocative
   20009      12         hodinách      noun              gender: feminine; number:                  plural;
                                                         case: locative
   20009      13         hodinami      noun              gender: feminine; number:                  plural;
                                                         case: instrumental

                                              Table 1: Tvaroslovník


                                   idWord      preposition   case
                                   6016        null          accusative
                                   6016        proti         dative
                                   31494       v             locative
                                   31494       null          accusative
                                   31494       null          instrumental
                                   62420       null          accusative

                             Table 2: Examples of covalencies for noun and verbs


all possible relations for the word. In the method             perníkové srdce.6
where we gradually iterate over the list of possible
relations and remove relations with the same depen-          • Different part-of-speech for same form: Expect a
dent word as the currently selected relation, we lo-           word having the same form in multiple cases may
cate a relation with the same dependent and supe-              also have the same form for multiple parts of speech.
rior word but with a different priority. We create an-         For example, the word to is a pronoun and particle.
other list of final and possible relations assigning a         We created a list that contains the most commonly
relation with a different priority. The method then            used part of speech for these words. If we set the
outputs two trees. Figure 4 illustrates the two pos-           method to find only the most relevant sentence struc-
sible outputs for sentence Diev£a upieklo mame                 tures, we use only the most often used part of speech
                                                               for a form.
                                                              6 The girl baked a gingerbread heart for mum.
    Dependent                Superior             Priority    Required grammatical categories
    verb                     auxiliary verb       13          none
    noun, adjective, pro-    auxiliary verb       13          none
    noun, numeral
    verb                     conjunction          12          none
    noun                     conjunction          12          none
    adjective                conjunction          12          none
    pronoun                  conjunction          12          none
    numeral                  conjunction          12          none
    adverb                   conjunction          12          none
    adverb                   adverb               11          none
    adverb                   adjective            11          none
    pronoun sa, si           verb                 10          none
    pronoun                  adjective            9           none
    adjective                noun                 8           same gender, case and number
    numeral                  noun                 8           same gender, case and number
    pronoun                  noun                 8           same gender, case and number
    noun                     noun                 7           case of dependent noun is accusative
    noun                     noun                 6           case of dependent noun is genitive
    adjective                preposition          5           same case
    pronoun                  preposition          5           same case
    noun                     preposition          4           same case
    preposition              noun                 4           noun and preposition are together in valency dictionary
    pronoun                  verb                 3           case of pronoun is not in valency dictionary and pronoun
                                                              shouldn’t be in the nominative case
    noun                     verb                 3           case of noun is not in valency dictionary and noun
                                                              shouldn’t be in the nominative case
    adjective                verb                 3           case of adjective is not in valency dictionary and adjec-
                                                              tive shouldn’t be in the nominative case
    numeral                  verb                 3           case of numeral is not in valency dictionary and numeral
                                                              shouldn’t be nominative case
    pronoun                  verb                 2           case of pronoun is in valency dictionary and pronoun
                                                              shouldn’t be in the nominative case
    noun                     verb                 2           case of noun is in valency dictionary and noun shouldn’t
                                                              be in the nominative case
    adjective                verb                 2           case of adjective is in valency dictionary and adjective
                                                              shouldn’t be in the nominative case
    numeral                  verb                 2           case of numeral is in valency dictionary and numeral
                                                              shouldn’t be nominative case
    adverb                   verb                 2           none
    noun                     verb                 1           noun should be in the first case
    adjective                verb                 1           adjective should be in the first case
    pronoun                  verb                 1           pronoun should be in the first case
    numeral                  verb                 1           numeral should be in the first case

                                           Table 3: Relations and their priorities


5   Conclusion and future research                                  • simple sentences: Martin zavrtel hlavou.7 ,
                                                                    • simple sentences with different sentence members:
                                                                      Chlapec vykro£il z tie¬a tmavých jedlí
To analyze the algorithm for creating a tree structure, we            na £istinku uprostred lesa.8 ,
built a dataset with 100 different Slovak sentences. Sen-
tence are taken from fairy-tales and articles on Internet.          7 Martin waved his head.

Dataset contains:                                                   8 The boy walked out of the shadows of dark firs to a clearing in the
                                                                      a


                                                     £íta                               pí²e


                                       Mama                noviny               otec                správu

              Figure 2: Example of sentence tree structure for Mama £íta noviny a otec pí²e správu.4

                                                                    za£alo

                                                    Ráno                                 pr²a´

                             Figure 3: Example of sentence tree structure for Ráno za£alo pr²a´.5

                                                                          upieklo


                                                 Diev£a                             mame              srdce


                                                                                           perníkové


                                                                             A

                                                                      upieklo


                                                  srdce                           mame               diev£a


                                      Perníkové

                                                                            B
          Figure 4: Example of two possible outputs for sentence Diev£a upieklo mame perníkové srdce.6


   • compound sentences: Te²í sa z jeho krásy                                We created this dataset manually. To each sentence, we
     a uºíva si pokojný relax.9 ,                                            added the required tree structure. As a result, we received
                                                                             85 identical tree structures. The main difficulties for find-
   • sentences with multiple sentence member:                                ing incorrect structure were:
     Uprostred hlu£ného a ubehaného meste£ka
     leºí krásny zelený park.10 ,                                                • Digital number in a sentence. For example, Hrad
                                                                                   vznikol pravdepodobne v druhej polovici
   • sentences with compound predicate: V mestskej                                 13. storo£ia.12
     £asti si môºu náv²tevníci uºi´
     kúpalisko.11 .                                                              • Changing the position of words in a nominal predi-
                                                                                   cate. For example, Vhodná je paralela z £ias
middle of the forest.                                                              môjho starého otca.13
    9 She enjoys its beauty and enjoys peaceful relaxation.
   10 In the middle of a noisy and deserted town lies a beautiful green      In our future work we want to focus on:
park.                                                                            12 The castle was probably built in the second half of the 13th century.
   11 Visitors can enjoy the swimming pool in the city.                          13 A parallel from my grandfather’s time is appropriate.
   • eliminating the above problems

   • testing method on other sentences
   • creating a web interface for this algorithm


References
 [1] Khurana, D., Koli, A., Khatter, K., Singh, S.: Natural lan-
     guage processing: State of the art, current trends and chal-
     lenges. 2017. arXiv preprint arXiv:1708.05148.
 [2] Zhang, M.: A survey of syntactic-semantic parsing based
     on constituent and dependency structures. Science China
     Technological Sciences (2020): 1–23.
 [3] https://ufal.mff.cuni.cz/pdt2.0/doc/pdt-
     guide/cz/html/index.html. (Accessed on 06/10/2021)
 [4] Straka, M., Straková, J., Hajic, J.: Prague at EPE 2017:
     The UDPipe system. 2017. In Proceedings of the 2017
     Shared Task on Extrinsic Parser Evaluation at the Fourth
     International Conference on Dependency Linguistics and
     the 15th International Conference on Parsing Technologies.
     Pisa, Italy (pp. 65–74).
 [5] Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD
     shared task. 2018. In Proceedings of the CoNLL 2018
     Shared Task: Multilingual Parsing from Raw Text to Uni-
     versal Dependencies (pp. 197–207).
 [6] https://nlp.fi.muni.cz/en/NLPCentre.       (Accessed     on
     06/10/2021)
 [7] http://utkl.ff.cuni.cz/en/utkl.html.     (Accessed       on
     06/10/2021)
 [8] Peciar, Š.: (Ed.) Slovník slovenského jazyka (Vol. 4). Vy-
     davatel’stvo SAV. 1964.
 [9] Kraus, J.: Slovník cudzích slov: akademický. Slovenské
     pedagogické nakladatel’stvo. 2005.
[10] M. Považaj. a kol.: Pravidlá slovenského pravopisu. 4.
     nezmenené vyd. Bratislava. Veda 2013. 592 s. ISBN 978-
     80-224-1331-2
[11] https://slovnik.juls.savba.sk/. (Accessed on 06/10/2021)
[12] Garabík, R.: Slovenský národný korpus. 2020. Acceseed
     on https://korpus.sk/.
[13] https://korpus.sk/slovko.html. (Accessed on 06/10/2021)
[14] Zeman, D.: Slovak dependency treebank in universal de-
     pendencies. 2017. Journal of Linguistics/Jazykovedný ca-
     sopis, 68(2), 385–395.
[15] Krajči S., Novotný R.: Tvaroslovník – databáza tvarov slov
     slovenského jazyka. In zborník príspevkov z pracovného
     seminára ITAT. 2012.(pp. 57–61).
[16] Krajči S., Novotný R.: Projekt Tvaroslovník – slovník
     všetkých tvarov všetkých slovenských slov. Znalosti 2012.
     2012. 2012. pp. 109–112.Vydavatelství MFF UK.
[17] Hil’ovská, J.: Syntaktická analýza slovenskej vety pomo-
     cou Tvaroslovníka. UPJŠ.2017.
[18] Kačala, J.: (Ed.) Krátky slovník slovenského jazyka. Veda.
     1987
[19] Linková, M., Krajci, S.: Tree structure of Slovak sentences.
     2020. In Proceedings of the 20th Conference Information
     Technologies – Applications and Theory.(pp. 67–74).