=Paper=
{{Paper
|id=Vol-2962/paper06
|storemode=property
|title=Syntactic Analysis of the Slovak Sentence
|pdfUrl=https://ceur-ws.org/Vol-2962/paper06.pdf
|volume=Vol-2962
|authors=Michaela Vočková,Stanislav Krajči
|dblpUrl=https://dblp.org/rec/conf/itat/VockovaK21
}}
==Syntactic Analysis of the Slovak Sentence ==
Syntactic analysis of the Slovak sentence
Michaela Vočková and Stanislav Krajči
Institute of Computer Science, Pavol Jozef Šafárik University in Košice, Slovakia,
michaela.vockova@student.upjs.sk
Abstract: The natural language processing is recently a • Part of speech tagging describes a sentence, deter-
very discussed topic in computer science. The main idea mines the part of speech for each word.
is an understanding of human languages by computers. In
Some of the tasks can be used as a subtask for more com-
this work-in-progress paper, we propose the algorithm for
plex assignments [1].
creation of a tree structure of the Slovak sentence. The tree
Semantic and syntactic parsing is also part of natural lan-
structure of a sentence represents the relationships and de-
guage processing, aiming to provide internal relations be-
pendencies between words in a sentence. The root of the
tween words. There are two approaches for finding the
tree is a predicate. Understanding a structure of sentence
structure of sentence: constituent parsing and dependency
is important for other natural language processing tasks,
parsing. Constituent parsing provides a constituent tree
such as semantic analysis. There are many different types
where nodes are phrases. The goal is to find these phrases
of sentences in the Slovak language, which we took into
and their relations. The approaches of constituent pars-
account for creating the algorithm. For example, a mul-
ing include the chart-based and the transition-based mod-
tiple sentence member, compound sentence, compound
els. Both have statistical and neural models. Dependency
predicate and others. Our algorithm correctly analysed 85
parsing is using bilexicalized dependency grammar, which
sentences from 100 different sentences.
contains all semantic and syntactic dependencies. Depen-
dency parsing models are divided into two groups: graph-
1 Introduction based models and transition-based models, both of which
have their own statistical or neural network approaches
Natural language processing is part of artificial intelli- [2].
gence and linguistics, focusing on understanding human This work-in-progress paper proposes the improvement of
language by computers. There are different tasks in natu- algorithm for creation of a tree structure of the Slovak sen-
ral language processing: tence [19]. This algorithm is not based on statistical data
from the corpus, but takes raw data from Tvaroslovník. It
• Automatic summarization provides summaries or de- is a database of all forms of Slovak words. The tree struc-
tailed information of text of a known type. ture of a sentence can represent the relationships and de-
pendencies between words in a sentence. The root of the
• Co-reference resolution refers to a sentence or more
tree is a predicate. The tree structure for Slovak sentence:
extensive set of text determining which word refers
Hodina dnes za£ala malým kvízom.1 is shown in
to the same object.
Figure 1.
• Discourse analysis refers to the task of identifying the
discourse structure of a text. 2 State of Art
• Machine translation refers to automatic translation of
Institute of Formal and Applied Linguistic at Charles Uni-
text from one human language to another.
versity in Prague has created the Prague Dependency Cor-
• Morphological segmentation refers to separate words pus, which is an excellent contribution to natural language
into individual morphemes and identifies the class of processing. Several tools have been developed to find out
the morphemes. a sentence structure or work on other natural language pro-
cessing tasks based on this corpus or Universal Depen-
• Named entity recognition describes a stream of text dency Treebank. For example [3]:
and determines which text items relate to proper
names. • Netgraph – this is a graphically oriented client-server
application for searching in an annotated corpus.
• Optical character recognition gives an image repre-
• TrEd – an editor used to search for a syntactically
senting printed text, which helps determine the corre-
annotated sentence structure.
sponding or related text.
• Morfo – a system for morphological analysis of the
Czech language.
Copyright ©2021 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 The class starts with a small quiz today.
za£ala
Hodina dnes kvízom
malým
Figure 1: Example of sentence tree structure for Hodina dnes za£ala malým kvízom.1
• MorfoDita – a free tool for morphological analysis of For example, DEVELOPER visualizes an occurrence of
natural language texts. one or two words in the corpus. DIAKRITIK corrects the
diacritics, and KOLOKAT visualizes distances between
• Moses – a statistical machine translation system that two terms in the corpus [12]. Every two years, the insti-
automatically allows training translation models for tute organizes a conference SLOVKO on natural language
any language pair. processing [13]. In 2017, D. Zeman presented an article
• UDPipe – a trainable channel for tokenization, label- Slovak Dependency Treebanks in Universal Dependencies
ing, lemmatization, and relationship analysis. Insti- about converting the syntactically annotated part of the
tute developed two version of UDPipe [4], [5]. Slovak National Corpus into the annotation scheme known
as Universal Dependencies. Universal Dependencies is
The Natural Language Processing Centre at Masaryk Uni- an international standard and also the largest database of
versity in Brno is mainly engaged in research into the pro- freely available dependency treebank[14]. Database of
cessing of the Czech, English, and Slovak languages. They Slovak words and their forms Tvaroslovník was created at
deal with morphological, syntactic, and semantic analysis Pavol Jozef Šafárik University at Košice [15], [16]. Mas-
and the creation of corpora and dictionaries. The insti- ter thesis [17] deals with the creation of an algorithm for
tute has created several tools that work with morpholog- finding the structure of the sentence.
ical, syntactic, and semantic analysis. Examples include
[6]:
3 Dictionaries
• Majka – morphological analyzer for Slovak, Czech,
Polish, Swedish, German language. It is necessary to have more information about words to
• The Sketch engine – a tool used to search for infor- create a sentence structure. Therefore we are using the
mation from text corpora. dictionary Tvaroslovník and Valency dictionary for our al-
gorithm of syntactic analysis.
• CZ accent – a tool for adding accents to text.
• Synt and SET – parsers used to determine the struc- 3.1 Tvaroslovník
ture.
Tvaroslovník is a database of all forms of all Slovak words
• Visual Browser - Java software that visualizes data from [8] and [9]. Every row contains information about
into RDT format. form of the word, its part-of-speech and grammatical cat-
egories of the word. Data in Tvaroslovník was collected
Institute of Theoretical and Computational Linguistics at
from the dictionary of Slovak language. Database contains
Charles University develops computational tools for au-
approximately 220,000 words and 24,000,000 records of
tomatic language processing, for example, syntactic an-
words and all their forms. All data and information are
notation of Czech corpora or grammar-based treebank of
saved in one table. There is a list of columns:
Czech language. [7].
Similar to the Czech language, there are several tools, dic- • idWord – unique identification number for word,
tionaries, and conferences in natural language processing
research in Slovak languages. Language Institute of L’u- • idForm – unique identification number of word’s
dovít Štúr offers a wide selection of dictionaries. These in- form,
clude a [8], [9], [10] and much more [11]. It also provides
the Slovak National Corpus. It is an electronic database, • form – a form of a word,
mainly containing Slovak texts from 1955 from different • part-of-speech,
styles, genres, thematic areas, region and other. Language
Institute of L’udovít Štúr developed tools for searching • categories – grammatical categories, there are differ-
words in Slovak National Corpus and working with them. ent for every part-of-speech.
Table 1 shows an example of records for the word hodina input: sentence
2. output: tree structure of sentence
find all forms for words in sentence from Tvaroslovník;
create list of possible relations;
3.2 Valency dictionary while list of possible relation is not empty or sentence
Valency dictionary contains two types of the most com- has only one word do
mon covalence between words. First is covalence be- choose relation with greatest priority;
tween verb and preposition or verb and the most com- add chosen relation to list of final relations;
remove chosen relation from list
mon case of the following term. Covalence between
of possible relations;
noun and preposition is the second type of valency dic-
foreach relation in list of possible relation
tionary. To built the valency dictionary, we took noun do
and verbs from Tvaroslovník and covalencies with prepo- if relation has same dependent
sitions and cases were automatically created from exam- and different superior word as chosen
ples in Krátky slovník slovenského jazyka [18]. Dictio- relation then
nary cointans columns: remove relation from list
of possible relations;
• idWord — unique identification number for word
end
from Tvaroslovník, end
• preposition — preposition which follow after noun or remove dependent word of chosen relation
verb, from sentence;
if new possible relation is created then
• case — case of word after noun or verb. add new relation to list of possible
relations;
Table 2 illustrates examples from dictionary of covalence. end
end
build tree structure from list of final relations;
4 Tree structure of sentence Algorithm 1: Pseudocode for finding tree structure al-
gorithm
We presented the main idea of the algorithm for finding
the tree structure in the article [19]. For the algorithm,
we expanded the table of relations and added cases of Slo- • Multiple verbs in sentence: Occurrence of several
vak sentences, which we describe in the subsection Special verbs in a sentence is another specification of the sen-
cases of sentences. Table 3 illustrates the new relationship tence. Before we start looking for possible relation-
table, and algorithm 1 describes the pseudocode for the ships in a sentence, we determine if this is not the
main idea of the tree finding algorithm. case. After determining verbs, we search whether a
conjunction or a comma is in the sentence between
4.1 Special cases of sentences them. Finding a comma or conjunction classifies a
sentence as a sentence. Therefore, we divide the sen-
Slovak is a flexible language and has many peculiarities tence according to the conjunction or comma into
that we took into account when creating the method. subsections with which we work as separate sen-
• Multiple sentence member: The first is multiple tences. We connect these sentences with the rela-
sentence members. We find out whether there is tionships between the conjunction or comma and the
a conjunction or a comma in the sentence during roots of subsentences in the resulting output. Figure 2
searching for initial possible relations. If so, we look shows us example of sentence structure for sentence
at the word before and after the conjunction if it is the Mama £íta noviny a otec pí²e správu.4 In a
same part of speech and has the same grammatical sentence containing more verbs without conjunction
categories. After fulfilling the condition, we add a re- or comma between them, we assume that there is a
lation between conjunction and the words to the pos- compound verb relation. Therefore, we combine the
sible relations. The conjunction then takes over the found verbs with the relation and add them to the list
grammatical categories of the words it connects. For of possible relations. Figure 3 shows us example of
example, in sentence Noviny a £asopisy pí²u o such sentence structure for sentence Ráno za£alo
celebritách.3 words noviny and £asopisy are pr²a´.5
same sentence member, therefore there are relations
• Same form of word: Some words have the same
noviny and a with priority 12 and £asopisy and a form in several cases, so it is sometimes difficult to
with priority 12 in the list of possible relations. Word
determine which relationship they can form. We find
a participates as noun in nominative case.
2 hour 4 Mother is reading newspapers and father is writing an message.
3 Newspapers and magazines write about celebrities. 5 It started to rain in the morning.
idWord idForm form part-of-speech categories
20009 0 hodina noun gender: feminine; number: singular;
case: nominative
20009 1 hodiny noun gender: feminine; number: singular;
case: genitive
20009 2 hodine noun gender: feminine; number: singular;
case: dative
20009 3 hodinu noun gender: feminine; number: singular;
case: accusative
20009 4 hodina noun gender: feminine; number: singular;
case: vocative
20009 5 hodine noun gender: feminine; number: singular;
case: locative
20009 6 hodinou noun gender: feminine; number: singular;
case: instrumental
20009 7 hodiny noun gender: feminine; number: plural;
case: nominative
20009 8 hodín noun gender: feminine; number: plural;
case: genitive
20009 9 hodinám noun gender: feminine; number: plural;
case: dative
20009 10 hodiny noun gender: feminine; number: plural;
case: accusative
20009 11 hodiny noun gender: feminine; number: plural;
case: vocative
20009 12 hodinách noun gender: feminine; number: plural;
case: locative
20009 13 hodinami noun gender: feminine; number: plural;
case: instrumental
Table 1: Tvaroslovník
idWord preposition case
6016 null accusative
6016 proti dative
31494 v locative
31494 null accusative
31494 null instrumental
62420 null accusative
Table 2: Examples of covalencies for noun and verbs
all possible relations for the word. In the method perníkové srdce.6
where we gradually iterate over the list of possible
relations and remove relations with the same depen- • Different part-of-speech for same form: Expect a
dent word as the currently selected relation, we lo- word having the same form in multiple cases may
cate a relation with the same dependent and supe- also have the same form for multiple parts of speech.
rior word but with a different priority. We create an- For example, the word to is a pronoun and particle.
other list of final and possible relations assigning a We created a list that contains the most commonly
relation with a different priority. The method then used part of speech for these words. If we set the
outputs two trees. Figure 4 illustrates the two pos- method to find only the most relevant sentence struc-
sible outputs for sentence Diev£a upieklo mame tures, we use only the most often used part of speech
for a form.
6 The girl baked a gingerbread heart for mum.
Dependent Superior Priority Required grammatical categories
verb auxiliary verb 13 none
noun, adjective, pro- auxiliary verb 13 none
noun, numeral
verb conjunction 12 none
noun conjunction 12 none
adjective conjunction 12 none
pronoun conjunction 12 none
numeral conjunction 12 none
adverb conjunction 12 none
adverb adverb 11 none
adverb adjective 11 none
pronoun sa, si verb 10 none
pronoun adjective 9 none
adjective noun 8 same gender, case and number
numeral noun 8 same gender, case and number
pronoun noun 8 same gender, case and number
noun noun 7 case of dependent noun is accusative
noun noun 6 case of dependent noun is genitive
adjective preposition 5 same case
pronoun preposition 5 same case
noun preposition 4 same case
preposition noun 4 noun and preposition are together in valency dictionary
pronoun verb 3 case of pronoun is not in valency dictionary and pronoun
shouldn’t be in the nominative case
noun verb 3 case of noun is not in valency dictionary and noun
shouldn’t be in the nominative case
adjective verb 3 case of adjective is not in valency dictionary and adjec-
tive shouldn’t be in the nominative case
numeral verb 3 case of numeral is not in valency dictionary and numeral
shouldn’t be nominative case
pronoun verb 2 case of pronoun is in valency dictionary and pronoun
shouldn’t be in the nominative case
noun verb 2 case of noun is in valency dictionary and noun shouldn’t
be in the nominative case
adjective verb 2 case of adjective is in valency dictionary and adjective
shouldn’t be in the nominative case
numeral verb 2 case of numeral is in valency dictionary and numeral
shouldn’t be nominative case
adverb verb 2 none
noun verb 1 noun should be in the first case
adjective verb 1 adjective should be in the first case
pronoun verb 1 pronoun should be in the first case
numeral verb 1 numeral should be in the first case
Table 3: Relations and their priorities
5 Conclusion and future research • simple sentences: Martin zavrtel hlavou.7 ,
• simple sentences with different sentence members:
Chlapec vykro£il z tie¬a tmavých jedlí
To analyze the algorithm for creating a tree structure, we na £istinku uprostred lesa.8 ,
built a dataset with 100 different Slovak sentences. Sen-
tence are taken from fairy-tales and articles on Internet. 7 Martin waved his head.
Dataset contains: 8 The boy walked out of the shadows of dark firs to a clearing in the
a
£íta pí²e
Mama noviny otec správu
Figure 2: Example of sentence tree structure for Mama £íta noviny a otec pí²e správu.4
za£alo
Ráno pr²a´
Figure 3: Example of sentence tree structure for Ráno za£alo pr²a´.5
upieklo
Diev£a mame srdce
perníkové
A
upieklo
srdce mame diev£a
Perníkové
B
Figure 4: Example of two possible outputs for sentence Diev£a upieklo mame perníkové srdce.6
• compound sentences: Te²í sa z jeho krásy We created this dataset manually. To each sentence, we
a uºíva si pokojný relax.9 , added the required tree structure. As a result, we received
85 identical tree structures. The main difficulties for find-
• sentences with multiple sentence member: ing incorrect structure were:
Uprostred hlu£ného a ubehaného meste£ka
leºí krásny zelený park.10 , • Digital number in a sentence. For example, Hrad
vznikol pravdepodobne v druhej polovici
• sentences with compound predicate: V mestskej 13. storo£ia.12
£asti si môºu náv²tevníci uºi´
kúpalisko.11 . • Changing the position of words in a nominal predi-
cate. For example, Vhodná je paralela z £ias
middle of the forest. môjho starého otca.13
9 She enjoys its beauty and enjoys peaceful relaxation.
10 In the middle of a noisy and deserted town lies a beautiful green In our future work we want to focus on:
park. 12 The castle was probably built in the second half of the 13th century.
11 Visitors can enjoy the swimming pool in the city. 13 A parallel from my grandfather’s time is appropriate.
• eliminating the above problems
• testing method on other sentences
• creating a web interface for this algorithm
References
[1] Khurana, D., Koli, A., Khatter, K., Singh, S.: Natural lan-
guage processing: State of the art, current trends and chal-
lenges. 2017. arXiv preprint arXiv:1708.05148.
[2] Zhang, M.: A survey of syntactic-semantic parsing based
on constituent and dependency structures. Science China
Technological Sciences (2020): 1–23.
[3] https://ufal.mff.cuni.cz/pdt2.0/doc/pdt-
guide/cz/html/index.html. (Accessed on 06/10/2021)
[4] Straka, M., Straková, J., Hajic, J.: Prague at EPE 2017:
The UDPipe system. 2017. In Proceedings of the 2017
Shared Task on Extrinsic Parser Evaluation at the Fourth
International Conference on Dependency Linguistics and
the 15th International Conference on Parsing Technologies.
Pisa, Italy (pp. 65–74).
[5] Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD
shared task. 2018. In Proceedings of the CoNLL 2018
Shared Task: Multilingual Parsing from Raw Text to Uni-
versal Dependencies (pp. 197–207).
[6] https://nlp.fi.muni.cz/en/NLPCentre. (Accessed on
06/10/2021)
[7] http://utkl.ff.cuni.cz/en/utkl.html. (Accessed on
06/10/2021)
[8] Peciar, Š.: (Ed.) Slovník slovenského jazyka (Vol. 4). Vy-
davatel’stvo SAV. 1964.
[9] Kraus, J.: Slovník cudzích slov: akademický. Slovenské
pedagogické nakladatel’stvo. 2005.
[10] M. Považaj. a kol.: Pravidlá slovenského pravopisu. 4.
nezmenené vyd. Bratislava. Veda 2013. 592 s. ISBN 978-
80-224-1331-2
[11] https://slovnik.juls.savba.sk/. (Accessed on 06/10/2021)
[12] Garabík, R.: Slovenský národný korpus. 2020. Acceseed
on https://korpus.sk/.
[13] https://korpus.sk/slovko.html. (Accessed on 06/10/2021)
[14] Zeman, D.: Slovak dependency treebank in universal de-
pendencies. 2017. Journal of Linguistics/Jazykovedný ca-
sopis, 68(2), 385–395.
[15] Krajči S., Novotný R.: Tvaroslovník – databáza tvarov slov
slovenského jazyka. In zborník príspevkov z pracovného
seminára ITAT. 2012.(pp. 57–61).
[16] Krajči S., Novotný R.: Projekt Tvaroslovník – slovník
všetkých tvarov všetkých slovenských slov. Znalosti 2012.
2012. 2012. pp. 109–112.Vydavatelství MFF UK.
[17] Hil’ovská, J.: Syntaktická analýza slovenskej vety pomo-
cou Tvaroslovníka. UPJŠ.2017.
[18] Kačala, J.: (Ed.) Krátky slovník slovenského jazyka. Veda.
1987
[19] Linková, M., Krajci, S.: Tree structure of Slovak sentences.
2020. In Proceedings of the 20th Conference Information
Technologies – Applications and Theory.(pp. 67–74).