The Hybrid Approach to Part-of-Speech Disambiguation 1 2 3 E. P. Bruches , D. A. Karpenko , and V. A. Krayvanova 1 OnPositive, Novosibirsk, Russia, bruches@bk.ru 2 OnPositive, Novosibirsk, Russia, dmitry.karpenko@onpositive.com 3 Altai State Technical University, Barnaul, Russia, krayvanova@yandex.ru Abstract. In this paper a hybrid approach to the part-of-speech disam- biguation for Russian is presented. It combines the rule-based and the articial neural networks oriented approaches. The methods are applied independently, then the results are compared, and the decision about the right part of speech for the word-form is made. The main advantage of the presented algorithm is the ability to suggest several parts of speech when it is dicult to select the right one with a fair degree of condence. It reduces the probability of error during the syntactic and semantic lev- els analysis. Java implementation of the proposed algorithm is published under EPL. Keywords: NLP, part-of-speech disambiguation, articial neural net- works, morphological analysis 1 Introduction One of the barriers to use NLP in urgent needs for practical tasks is a lack of a free high-quality parser. The nal aim of our work is to create such a parser. This paper describes our solution of one of the problems we faced with, namely a Part-of-Speech (POS) ambiguity and need of a good POS-tagging. Over last few years in addition to traditional approaches to disambiguation (statistical[1] and rule-based[2]) machine learning based algorithms were pro- posed. In Stanford Tagger 2.0 Maximum entropy cyclic dependency network is used[3]. Santos et al.[4] have suggested the using MLP with Neural Character Embeddings. The tendency of applying machine learning algorithms to this task is also observed for the Russian language. Malanin[5] uses heterogeneous neu- ral networks to a POS-tagging in Russian. Antonova and Solov'ev in paper[6] present the results of a Conditional Random Fields (CRF) approach. Approaches based on the formal description of the rules for determining parts of speech, require much manual labour. Neural networks and other statistical ap- proaches allow to extract implicitly these rules from annotated corpora, but the question about suciency of samples arises. Hence, in this paper we suggest our own approach to a POS disambiguation, which combines both articial neural networks and rule-based approach. 2 Algorithm We use Russian sentences as an input of the algorithm. We develop the tagging module which based on morphological dictionary. The dictionary is formed from the OpenCorpora 4 and the Wiktionary5 data. The tags, which the dictionary consists of, are part of speech, gender, number, case etc. The full list can be found 6 on our web-site . At the initial stage a tagging module assigns morphological features to all the word-forms. Let T agsw be a set of tags, which a word-form w is assigned with. A pair Tw = hw, T agsw i is the token of the word-form w. The sequence of tokens is an input of the POS homonymy disambiguation module. Let SP0 ⊂ T agsw be a set of POS tags, which a word-form assigned with. The following situations can occur as a result of the disambiguation algorithm.  The single tag for a word-form was detected.  The set of POS tags was decreased.  The set of POS tags was keep which was assigned initially. We apply the neural networks approach and the rule-based approach indepen- dently and then compare results to make a general decision of disambiguation. Let SPr be a set of parts of speech, which were chosen by the rule-based ap- proach and let SPn be a set of parts of speech, which were chosen by the neural networks. If SPr ∩ SPn is not empty, then a word-form is assigned with all the tags from set SPnew = SPr ∩ SPn . Otherwise a word-form is assigned with all the tags from set SPnew = SPr ∪SPn . For some words with POS homonymy one of the parts of speech appears in vast majority of cases, while the other one is extremely rare. We have dened POS-tag for such words beforehand and apply a ltration algorithm to process these words. The list of words which ltration is used for, can be found on the project web-site . 7 Let's consider each approach by itself. There are several possible approaches to solving the task of POS disam- biguation with neural networks. In the simplest case one network with outputs corresponding to all the recognized parts of speech can be trained. However, such architecture makes possible theoretically to get a part of speech, which was absent in the dictionary for this word-form. Another possible approach is to train an individual neural network for every set of POS tags, which occur a in training set. The size of training subsets for such networks is limited, and 4 http://opencorpora.org/ 5 https://ru.wiktionary.org/ 6 http://176.9.34.20:8080/com.onpositive.text.webview/materials/tagsets. html 7 http://176.9.34.20:8080/com.onpositive.text.webview/materials/filters. html thus it cannot guarantee sucient recognition quality. Therefore we use the third approach, when for every ambiguous pair of POS tags hsp1 , sp2 i two neural net- works are trained. Neural network Nsp1 ,sp2 estimates the correctness of tag sp1 in the presence of the alternative tag sp2 , and neural network Nsp2 ,sp1 estimates it vice versa. For example,  äâàäöàòü ìèíóò åõàëè (`they were driving for twenty minutes ')  the homonymous word  ìèíóò (`minutes '/`will pass ') can be a noun or a verb, the right choice is a noun. Hence, neural network Nsp1 ,sp2 is a clas- sier with a single neuron on the output layer, which estimates likelihood that the input vector corresponds to the part of speech sp1 . Training sets for neural networks are made from a part of an annotated text corpus from Open Corpora. Main corpus of OpenCorpora does not have full POS disambiguation. The cases, which in dictionary have POS ambiguity and which were disambiguated in the corpus, are used for training. We have found words, which in dictionary have POS ambiguity, and trained each network on such cases. The obtained cases are classied by parts of speech and homonymy. For uncommon pairs hsp1 , sp2 i we are not able to collect a sucient training set, so these homonymy cases are not examined in this approach. The neural network analyses not the word-form but only its morphological image T agsn (w) ⊂ T ags(w), which includes the following key features: part of speech, case, gender, number, transitivity and aspect. The analysed context consists of three words: the homonymous word wi , wi−1 to the left of it and wi+1 to the right of it, where i is the word number in the sentence. At the sentence boundary the parameters of previous or following word are replaced with zeros. The case where a phrase consists of one word is not analysed. The network input is a vector of binary values (0 or 1), which is calculated as follows. Tags T agsn (w) for word-form w are converted into binary vector B with length M , where M is the number of tags in the corresponding set. Let xt be the binary value of tag t in some grammatical category, then xt = 1 if t ∈ T ags(w) and xt = 0 otherwise. For example, if xt corresponds to the Nominative case, then xt = 1 in the case when tagging module has assigned Nominative case to the current word-form. If the word-form has another case or even does not have such category, then xt = 0. If more than one tag correspond to the current word- form, i.e. homonymy, then all the binary values xt , which correspond to all the possible tags, will be set to 1 in the binary vector B . Let's consider the example of forming the data for the POS grammatical category. For simplicity let's use a reduced list of possible parts of speech: noun, adjective, verb, aderb. In this way, the word  òðàêòîð (`tractor '), which is unambiguously tagged as noun, will be associated with binary vector h1, 0, 0, 0i. The word  áåëèëà (`whitewash '), which can be a noun or a verb depending on the context, will be associated with binary vector h1, 0, 1, 0i. As a result, the group of binary vectors hBi−1 , Bi , Bi+1 i will correspond to the current word and its context. The numbering of the word- forms in the sentence begins with 0. Four bits E = he1 , e2 , e3 , e4 i, which encode word-form position concerning sentence boundaries, are added at the end of group hBi−1 , Bi , Bi+1 i. If i = 0 then e1 = 1; if i = 1 then e2 = 1; if i = sn − 2 then e3 = 1; if i = sn − 1 then e4 = 1, where sn is word-form quantity in the sentence. Obtained vector of binary values is given as an input to all the neural networks, designed to resolve this homonymy case. For example, if there is an ambiguity Noun  Verb, two networks Nnoun,verb and Nverb,noun designed for cases with noun and verb respectively will be used. If the result of the neural network is above the specied threshold value, then the veried POS-tag is supposed to be correct. The algorithm described above performs within the bounds of a sentence for all the words, which have homonymy, from left to right. According to the training rate criterion and recognition quality, we use multilayer homogeneous neural networks with a sigmoidal activation function and a resilient backpropagation algorithm for a training. The input layer has 133 neurons, each of two hidden layers has 532 neurons, and the output layer has the single neuron. Our rule-based approach to a POS disambiguation consists of two stages: rules applying and results testing with syntactic relations. In the algorithm the word-form context is considered within the bounds of a sentence. The algorithm applies the rules to each word-form wi , for which tagging module assigned more than one part of speech, i.e. |SP0 | > 1. The rules are created manually and represent the pair C(hT i) → sp, where C(hT i) is a predicate, which has the context of analysed word-form wi as input parameters; hT i is a set of tokens; sp is a part of speech, assigned to the analysed word-form wi , if a predicate is true on this word-form context. Each rule is aimed at the choice between two parts of speech. All the rules are divided into several groups depending on the homonymy type they deal with. They are presented by two types: 1. The rules, that require the presence of certain grammatical features in the word-form context (for example, one of the rules to disambiguate the word- form  âåñòè (`news '/'to lead ') requires the presence of a preposition to dene this word-form as an noun); 2. The rules, that require the absence of certain grammatical features in the word-form context (for example, one of the rules to disambiguate the word- form  òîì (`volume '/'that ') requires the absence of a preposition to dene this word-form as a noun). 8 The obtained sets of rules could be found on the web-site . Before analysis the tokens Twi−1 , Twi , Twi+1 are divided into parts of speech and all the possible sp sets with two and three tokens are formed. Let Tw be POS-tags from token Twi . i According to the number of analysed tokens all the rules are divided into:  triplet hT i = hTwspi−1 , Twspi , Twspi+1 i;  paired prex hT i = hTwspi−1 , Twspi i;  paired postx hT i = hTwspi , Twspi+1 i;  paired neutral hT i = hTwspi , Twspj i , where Twspj = Twspi+1 or Twspj = Twspi−1 . For example, if Twi−1 has two possible parts of speech, and Twi has three possible parts of speech, then 2 * 3 = 6 possible combinations will be produced for prex 8 http://176.9.34.20:8080/com.onpositive.text.webview/materials/rules. html sp and neutral types of rules. All the possible combinations for hTw i−1 , Twspi i are given as an input to all the prex and neutral rules, all possible combinations sp sp for hTw , Tw i are given as an input to all the postx and neutral rules. The i i+1 POS-tags from the initial set SP0 for the word-form wi , which are included into the phrases satisfying at least to one of the rules, are kept. At the beginning and at the end of the sentence the rules, which have only two arguments, are used as well as for the sentences consisting of two words. We do not apply this approach to the sentences consisting of the single word. After the word-form was assigned with the set of POS SPr , the result is tested using a syntactic rules. The testing is as follows. There is the set of syntactic relations, which can be formed by the words with the certain POS-tag. If the word with the assigned POS-tag sp ∈ SPr is able to form the syntactic relation in the sentence, then sp is supposed to be correct. If some of POS-tags are veried by syntactic rules, the other POS-tags are removed. The algorithm proposes all the POS-tags, assigned by the rules, if there is no veried POS-tags. 3 Evaluation and Results The evaluation of the proposed algorithm quality was carried out on two disam- biguated text corpora: OpenCorpora (the part which is not used for training) 9 and RusCorpora . In order to evaluate the implementation quality for words with POS homonymy, we carried out the precision of partial disambiguation (at least on tag from assigned is correct) with formula 1. WsemiAmbig P = , WAmbig where WsemiAmbig is the number of words, for which SPnew ∩ SPetalon 6= ∅, SPnew is POS-tag set for current ambigious word-form, which was formed as a result of the algorithm implementation, SPetalon is POS-tag set, which the word- form was assigned with in the corpus; WAmbig is the number of words, which were initially assigned with more than one POS-tags. Another characteristic of the implementation quality is accuracy for all the words (including words without homonymy), which was carried out with formula 2. Wcorrect Acc = , Wtotal where Wcorrect is the number of words, for which SPnew ∩ SPetalon 6= ∅, Wtotal is the total number of words. We also have calculated, how the size of the set of assigned POS-tags decreased after processing. Corpora sizes and evaluation results are shown in table 1. The last column shows what percent of extra POS- tags was removed. Such considerable divergence in results can be explained by the fact, that in these corpora the dierent tag sets and, moreover, the dierent approaches to determination of parts of speech are used. 9 http://www.ruscorpora.ru/ Corpus Total size Homonymy words Precision Accuracy Extra tags removed OpenCorpora 33566 5880 96.11% 99.02% 93.04% RusCorpora 169908 61137 86.39% 93.54% 89.22% Table 1. Results For comparison, the Trigram model, described in the paper [1], has precision 97,18% in the Part-of-Speech Tagging task. M-WANN-Tagger, presented in the paper [7], copes with the same task for Russian with precision 97,01%. The authors in [5] managed to achieve precision 96,56% in the Part-of-Speech Tagging using heterogeneous neural networks. 4 Conclusion The proposed approach to the part-of-speech disambiguation for Russian com- bines both machine learning and expert systems. The results revealed that the algorithm copes with notional parts of speech, and it is less eective with func- tional ones. There is one of the priority lines for our future investigation. We are also going to improve this algorithm for multi-words and unknown words and to take into account punctuation. The java implementation of the proposed algorithm is available on our web-site 10 . References 1. Sokirko, A., Toldova, C.: The comparison of two methods of lexical and morpho- logical disambiguation for Russian. Internet-mathematics 2005 (2005) 2. Brill, E.: A Simple Rule-Based Part of Speech Tagger. In: Proceedings of ANLC'92. (1992) 152  155 3. Manning, C.D.: Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? . In: Computational Linguistics and Intelligent Text Processing, 12th International Conference, Part I. Lecture Notes in Computer Science 6608. (2011) 171  189 4. Santos, C., Zadrozny, B.: Learning character-level representations for part-of-speech tagging . In: Proceedings of the 31st International Conference on Machine Learning, JMLR: W&CP. (2014) 5. Malanin, G.: Part-of-Speech tagging with heterogeneous neural networks and a priori information. Youth scientic-technical journal (12) (2014) 6. Antonova, A., Solov'ev, A.: Conditional Random Fields in NLP-related Tasks for Russian . Information Technologies and Systems (2013) 7. Carneiro, H., França, F., Lima, P.: Multilingual part-of-speech tagging with weight- less neural networks. Neural Networks (16) (2015) 11  21 10 http://176.9.34.20:8080/com.onpositive.text.webview/parsing/omonimy