The Hybrid Approach to Part-of-Speech
                                  Disambiguation

                              1                2                           3
            E. P. Bruches , D. A. Karpenko , and V. A. Krayvanova

                          1
                            OnPositive, Novosibirsk, Russia,
                                   bruches@bk.ru
                          2
                            OnPositive, Novosibirsk, Russia,
                         dmitry.karpenko@onpositive.com
                 3
                   Altai State Technical University, Barnaul, Russia,
                               krayvanova@yandex.ru


       Abstract.     In this paper a hybrid approach to the part-of-speech disam-
       biguation for Russian is presented. It combines the rule-based and the
       articial neural networks oriented approaches. The methods are applied
       independently, then the results are compared, and the decision about the
       right part of speech for the word-form is made. The main advantage of
       the presented algorithm is the ability to suggest several parts of speech
       when it is dicult to select the right one with a fair degree of condence.
       It reduces the probability of error during the syntactic and semantic lev-
       els analysis. Java implementation of the proposed algorithm is published
       under EPL.

       Keywords:   NLP, part-of-speech disambiguation, articial neural net-
       works, morphological analysis


1    Introduction
One of the barriers to use NLP in urgent needs for practical tasks is a lack of
a free high-quality parser. The nal aim of our work is to create such a parser.
This paper describes our solution of one of the problems we faced with, namely
a Part-of-Speech (POS) ambiguity and need of a good POS-tagging.
    Over last few years in addition to traditional approaches to disambiguation
(statistical[1] and rule-based[2]) machine learning based algorithms were pro-
posed. In Stanford Tagger 2.0 Maximum entropy cyclic dependency network is
used[3]. Santos et al.[4] have suggested the using MLP with Neural Character
Embeddings. The tendency of applying machine learning algorithms to this task
is also observed for the Russian language. Malanin[5] uses heterogeneous neu-
ral networks to a POS-tagging in Russian. Antonova and Solov'ev in paper[6]
present the results of a Conditional Random Fields (CRF) approach.
    Approaches based on the formal description of the rules for determining parts
of speech, require much manual labour. Neural networks and other statistical ap-
proaches allow to extract implicitly these rules from annotated corpora, but the
question about suciency of samples arises. Hence, in this paper we suggest our
own approach to a POS disambiguation, which combines both articial neural
networks and rule-based approach.


2    Algorithm
We use Russian sentences as an input of the algorithm. We develop the tagging
module which based on morphological dictionary. The dictionary is formed from
the OpenCorpora
                    4 and the Wiktionary5 data. The tags, which the dictionary
consists of, are part of speech, gender, number, case etc. The full list can be found
                6
on our web-site . At the initial stage a tagging module assigns morphological
features to all the word-forms.
   Let T agsw be a set of tags, which a word-form w is assigned with. A pair
Tw = hw, T agsw i is the token of the word-form w. The sequence of tokens is an
input of the POS homonymy disambiguation module. Let SP0 ⊂ T agsw be a
set of POS tags, which a word-form assigned with. The following situations can
occur as a result of the disambiguation algorithm.

  The single tag for a word-form was detected.
  The set of POS tags was decreased.
  The set of POS tags was keep which was assigned initially.
    We apply the neural networks approach and the rule-based approach indepen-
dently and then compare results to make a general decision of disambiguation.
Let SPr be a set of parts of speech, which were chosen by the rule-based ap-
proach and let SPn be a set of parts of speech, which were chosen by the neural
networks. If SPr ∩ SPn is not empty, then a word-form is assigned with all the
tags from set SPnew = SPr ∩ SPn . Otherwise a word-form is assigned with all
the tags from set SPnew = SPr ∪SPn . For some words with POS homonymy one
of the parts of speech appears in vast majority of cases, while the other one is
extremely rare. We have dened POS-tag for such words beforehand and apply
a ltration algorithm to process these words. The list of words which ltration
is used for, can be found on the project web-site .
                                                   7
    Let's consider each approach by itself.
    There are several possible approaches to solving the task of POS disam-
biguation with neural networks. In the simplest case one network with outputs
corresponding to all the recognized parts of speech can be trained. However,
such architecture makes possible theoretically to get a part of speech, which
was absent in the dictionary for this word-form. Another possible approach is
to train an individual neural network for every set of POS tags, which occur
a in training set. The size of training subsets for such networks is limited, and

4
  http://opencorpora.org/
5
  https://ru.wiktionary.org/
6
  http://176.9.34.20:8080/com.onpositive.text.webview/materials/tagsets.
  html
7
  http://176.9.34.20:8080/com.onpositive.text.webview/materials/filters.
  html
thus it cannot guarantee sucient recognition quality. Therefore we use the third
approach, when for every ambiguous pair of POS tags hsp1 , sp2 i two neural net-
works are trained. Neural network Nsp1 ,sp2 estimates the correctness of tag sp1 in
the presence of the alternative tag sp2 , and neural network Nsp2 ,sp1 estimates it
vice versa. For example,  äâàäöàòü ìèíóò åõàëè (`they were driving for twenty
minutes ')  the homonymous word  ìèíóò (`minutes '/`will pass ') can be a noun
or a verb, the right choice is a noun. Hence, neural network Nsp1 ,sp2 is a clas-
sier with a single neuron on the output layer, which estimates likelihood that
the input vector corresponds to the part of speech sp1 . Training sets for neural
networks are made from a part of an annotated text corpus from Open Corpora.
Main corpus of OpenCorpora does not have full POS disambiguation. The cases,
which in dictionary have POS ambiguity and which were disambiguated in the
corpus, are used for training. We have found words, which in dictionary have
POS ambiguity, and trained each network on such cases. The obtained cases are
classied by parts of speech and homonymy. For uncommon pairs hsp1 , sp2 i we
are not able to collect a sucient training set, so these homonymy cases are not
examined in this approach.

   The neural network analyses not the word-form but only its morphological
image T agsn (w)   ⊂ T ags(w), which includes the following key features: part
of speech, case, gender, number, transitivity and aspect. The analysed context
consists of three words: the homonymous word wi , wi−1 to the left of it and wi+1
to the right of it, where i is the word number in the sentence. At the sentence
boundary the parameters of previous or following word are replaced with zeros.
The case where a phrase consists of one word is not analysed.

   The network input is a vector of binary values (0 or 1), which is calculated as
follows. Tags T agsn (w) for word-form w are converted into binary vector B with
length M , where M is the number of tags in the corresponding set. Let xt be the
binary value of tag t in some grammatical category, then xt = 1 if t ∈ T ags(w)
and xt = 0 otherwise. For example, if xt corresponds to the Nominative case,
then xt = 1 in the case when tagging module has assigned Nominative case to
the current word-form. If the word-form has another case or even does not have
such category, then xt = 0. If more than one tag correspond to the current word-
form, i.e. homonymy, then all the binary values xt , which correspond to all the
possible tags, will be set to 1 in the binary vector B . Let's consider the example
of forming the data for the POS grammatical category. For simplicity let's use
a reduced list of possible parts of speech: noun, adjective, verb, aderb. In this
way, the word  òðàêòîð (`tractor '), which is unambiguously tagged as noun, will
be associated with binary vector h1, 0, 0, 0i. The word  áåëèëà (`whitewash '),
which can be a noun or a verb depending on the context, will be associated with
binary vector h1, 0, 1, 0i. As a result, the group of binary vectors hBi−1 , Bi , Bi+1 i
will correspond to the current word and its context. The numbering of the word-
forms in the sentence begins with 0. Four bits E = he1 , e2 , e3 , e4 i, which encode
word-form position concerning sentence boundaries, are added at the end of
group hBi−1 , Bi , Bi+1 i. If i = 0 then e1 = 1; if i = 1 then e2 = 1; if i = sn − 2
then e3 = 1; if i = sn − 1 then e4 = 1, where sn is word-form quantity in the
sentence. Obtained vector of binary values is given as an input to all the neural
networks, designed to resolve this homonymy case. For example, if there is an
ambiguity Noun  Verb, two networks Nnoun,verb and Nverb,noun designed for
cases with noun and verb respectively will be used. If the result of the neural
network is above the specied threshold value, then the veried POS-tag is
supposed to be correct. The algorithm described above performs within the
bounds of a sentence for all the words, which have homonymy, from left to
right. According to the training rate criterion and recognition quality, we use
multilayer homogeneous neural networks with a sigmoidal activation function
and a resilient backpropagation algorithm for a training. The input layer has
133 neurons, each of two hidden layers has 532 neurons, and the output layer
has the single neuron.
     Our rule-based approach to a POS disambiguation consists of two stages:
rules applying and results testing with syntactic relations. In the algorithm the
word-form context is considered within the bounds of a sentence. The algorithm
applies the rules to each word-form wi , for which tagging module assigned more
than one part of speech, i.e. |SP0 |  > 1. The rules are created manually and
represent the pair C(hT i)   → sp, where C(hT i) is a predicate, which has the
context of analysed word-form wi as input parameters; hT i is a set of tokens;
sp is a part of speech, assigned to the analysed word-form wi , if a predicate is
true on this word-form context. Each rule is aimed at the choice between two
parts of speech. All the rules are divided into several groups depending on the
homonymy type they deal with. They are presented by two types:

1. The rules, that require the presence of certain grammatical features in the
     word-form context (for example, one of the rules to disambiguate the word-
     form  âåñòè (`news '/'to    lead ') requires the presence of a preposition to
     dene this word-form as an noun);
2. The rules, that require the absence of certain grammatical features in the
     word-form context (for example, one of the rules to disambiguate the word-
     form  òîì (`volume '/'that ') requires the absence of a preposition to dene
     this word-form as a noun).

                                                                    8
The obtained sets of rules could be found on the web-site . Before analysis the
tokens Twi−1 , Twi , Twi+1 are divided into parts of speech and all the possible
                                                        sp
sets with two and three tokens are formed. Let Tw             be POS-tags from token Twi .
                                                          i
According to the number of analysed tokens all the rules are divided into:

  triplet hT i = hTwspi−1 , Twspi , Twspi+1 i;
  paired prex hT i = hTwspi−1 , Twspi i;
  paired postx hT i = hTwspi , Twspi+1 i;
  paired neutral hT i = hTwspi , Twspj i , where Twspj = Twspi+1 or Twspj = Twspi−1 .
For example, if Twi−1 has two possible parts of speech, and Twi has three possible
parts of speech, then 2 * 3 = 6 possible combinations will be produced for prex

8
    http://176.9.34.20:8080/com.onpositive.text.webview/materials/rules.
    html
                                                                     sp
and neutral types of rules. All the possible combinations for hTw
                                                                      i−1
                                                                            , Twspi i are
given as an input to all the prex and neutral rules, all possible combinations
       sp   sp
for hTw , Tw       i are given as an input to all the postx and neutral rules. The
        i    i+1
POS-tags from the initial set SP0 for the word-form wi , which are included into
the phrases satisfying at least to one of the rules, are kept. At the beginning
and at the end of the sentence the rules, which have only two arguments, are
used as well as for the sentences consisting of two words. We do not apply this
approach to the sentences consisting of the single word. After the word-form was
assigned with the set of POS SPr , the result is tested using a syntactic rules.
The testing is as follows. There is the set of syntactic relations, which can be
formed by the words with the certain POS-tag. If the word with the assigned
POS-tag sp ∈ SPr is able to form the syntactic relation in the sentence, then sp
is supposed to be correct. If some of POS-tags are veried by syntactic rules, the
other POS-tags are removed. The algorithm proposes all the POS-tags, assigned
by the rules, if there is no veried POS-tags.


3     Evaluation and Results
The evaluation of the proposed algorithm quality was carried out on two disam-
biguated text corpora: OpenCorpora (the part which is not used for training)
                     9
and RusCorpora . In order to evaluate the implementation quality for words
with POS homonymy, we carried out the precision of partial disambiguation (at
least on tag from assigned is correct) with formula 1.

                                         WsemiAmbig
                                   P =              ,
                                          WAmbig

where WsemiAmbig is the number of words, for which SPnew ∩ SPetalon        6= ∅,
SPnew is POS-tag set for current ambigious word-form, which was formed as a
result of the algorithm implementation, SPetalon is POS-tag set, which the word-
form was assigned with in the corpus; WAmbig is the number of words, which were
initially assigned with more than one POS-tags. Another characteristic of the
implementation quality is accuracy for all the words (including words without
homonymy), which was carried out with formula 2.

                                           Wcorrect
                                   Acc =            ,
                                           Wtotal
where Wcorrect is the number of words, for which SPnew ∩ SPetalon 6= ∅, Wtotal
is the total number of words. We also have calculated, how the size of the set
of assigned POS-tags decreased after processing. Corpora sizes and evaluation
results are shown in table 1. The last column shows what percent of extra POS-
tags was removed. Such considerable divergence in results can be explained by
the fact, that in these corpora the dierent tag sets and, moreover, the dierent
approaches to determination of parts of speech are used.

9
    http://www.ruscorpora.ru/
       Corpus   Total size Homonymy words Precision Accuracy Extra tags removed
     OpenCorpora 33566          5880        96.11% 99.02%          93.04%
     RusCorpora 169908         61137        86.39% 93.54%          89.22%
                                  Table 1. Results


      For comparison, the Trigram model, described in the paper [1], has precision
97,18% in the Part-of-Speech Tagging task. M-WANN-Tagger, presented in the
paper [7], copes with the same task for Russian with precision 97,01%. The
authors in [5] managed to achieve precision 96,56% in the Part-of-Speech Tagging
using heterogeneous neural networks.


4      Conclusion
The proposed approach to the part-of-speech disambiguation for Russian com-
bines both machine learning and expert systems. The results revealed that the
algorithm copes with notional parts of speech, and it is less eective with func-
tional ones. There is one of the priority lines for our future investigation. We
are also going to improve this algorithm for multi-words and unknown words
and to take into account punctuation. The java implementation of the proposed
algorithm is available on our web-site
                                         10 .


References
1. Sokirko, A., Toldova, C.: The comparison of two methods of lexical and morpho-
   logical disambiguation for Russian. Internet-mathematics 2005 (2005)
2. Brill, E.: A Simple Rule-Based Part of Speech Tagger. In: Proceedings of ANLC'92.
   (1992) 152  155
3. Manning, C.D.: Part-of-Speech Tagging from 97% to 100%: Is It Time for Some
   Linguistics? . In: Computational Linguistics and Intelligent Text Processing, 12th
   International Conference, Part I. Lecture Notes in Computer Science 6608. (2011)
   171  189
4. Santos, C., Zadrozny, B.: Learning character-level representations for part-of-speech
   tagging . In: Proceedings of the 31st International Conference on Machine Learning,
   JMLR: W&CP. (2014)
5. Malanin, G.: Part-of-Speech tagging with heterogeneous neural networks and a
   priori information. Youth scientic-technical journal (12) (2014)
6. Antonova, A., Solov'ev, A.: Conditional Random Fields in NLP-related Tasks for
   Russian . Information Technologies and Systems (2013)
7. Carneiro, H., França, F., Lima, P.: Multilingual part-of-speech tagging with weight-
   less neural networks. Neural Networks (16) (2015) 11  21


10
     http://176.9.34.20:8080/com.onpositive.text.webview/parsing/omonimy