Introduction

Baselines and Symbol N-Grams: Simple Part-Of-Speech Tagging of Russian?

Nikolay V. Arefyev

Pavel A. Ermolaev

ermolaev.p.a@yandex.ru 0 0 Lomonosov Moscow State University , Moscow , Russia

We propose using NB-SVM over bag of character n-grams input representation for determining part-of-speech tags and grammatical categories like gender, number, etc. for words in Russian texts. Several methods are compared including CRF (Conditional Random Fields), SVM (Support Vector Machines) and NB-SVM (Naive Bayes SVM) and superiority of NB-SVM over other classi ers is shown. The proposed model is the 5th best among 12 other models in the rst shared task of the MorphoRuEval-17 challenge. We also experimented with category grouping when a single classi er is used to determine several grammatical categories and showed that it improves the model performance even further.

part-of-speech tagging NB-SVM CRF multi-output classi cation

Introduction

MorphoRuEval-17 (Sorokin, 2017) challenge consists of two shared tasks. The rst one is to determine the part of speech and a number of grammatical categories (case, gender, number, etc.) for each word in a sentence. The second one is lemmatization.

There are various systems for the Russian language, such as Pymorphy, which determine grammatical categories and solve lemmatization problem, but unfortunately they do not have built-in solutions for resolving morphological ambiguity. Because of this, all possible variants of analysis are given for each word, which makes the practical application of such analyzers di cult.

The purpose of this article is to compare di erent approaches to resolving ambiguities. It can help selecting correct hypothesis among those returned by Pymorphy-like systems. So we focus on the rst task of MorphoRuEval-17 and do not perform lemmatization.

Part-of-speech tagging is an instance of sequence labeling task which is one of the fundamental tasks in Natural Language Processing. The di culty of partof-speech tagging lies in ambiguous and out-of-vocabulary words which require ? The title refers to the paper Baselines and bigrams: Simple, good sentiment and topic classi cation by Sida Wang and Christopher D. Manning which had a great in uence on this work. using context information as well as internal word structure. The most popular methods of sequence labeling are Hidden Markov Models (Brants, 2000) and Conditional Random Fields (La erty, 2001) . Until recently implementing a good part-of-speech tagger required a lot of hand feature engineering, however recent successes in deep neural networks allowed to learn necessary features during model training. First convolutional neural networks (dos Santos, 2014) and later recurrent neural networks (Plank, 2016) showed state of the art or near state of the art results in part-of-speech tagging learning from raw texts and using rather simple approaches compared to the previous best results. Unfortunately neural networks require much more training data and computational resources to show good results. For instance, we tried convolutional neural networks similar to the ones in (dos Santos, 2014) for MorphoRuEval-17 tasks but experienced an order of magnitude larger training time compared to linear models (several hours instead of 5-10 minutes) and were not able to make them perform better than linear models before challenge deadline.

Grammatical information such as number for nouns or tense for verbs is often added to part-of-speech tags increasing the number of classes (for instance, plural and singular nouns are usually di erent classes). This reduces the task to standard multi-class classi cation where each example belongs to exactly one class. However, for morphologically rich languages this approach leads to very large number of classes and very few examples for some of them. We treat the task as an instance of multi-output classi cation instead, i.e. for each grammatical category we train a separate multi-class classi er.

In section 2 of the paper we compare several approaches to part-of-speech tagging on MorphoRuEval-17 dataset. All our classi ers were trained from scratch, i.e. we used only training set provided by MorphoRuEval-17 challenge and did not use any dictionaries (including provided), unlabeled datasets or other a priori knowledge. In section 3 we propose an extension allowing us in addition to part-of-speech tags determining also grammatical categories (case, gender, number, etc.) which is necessary to solve the rst task of MorphoRuEval-17. We describe several tricks to obtain better classi cation results. The code allowing to reproduce our best results is publicly available 1. The main contributions of this paper are the following: 1. We proposed using NB-SVM model with bag of character n-grams input representation for POS-tagging and showed its superiority over linear SVM in this scenario. 2. We introduced scikit-learn compatible NB-SVM implementation for easier exploitation by NLP community. 3. We showed that it is bene cial to use a single classi er to jointly determine several grammatical categories (for instance, number and case).

1 https://github.com/nvanva/MorphoBabushka Part-of-speech tagging

For part-of-speech tagging we tried two approaches: window-based and sentencebased classi cation. A window-based classi er treats each window (a target token to be classi ed with a xed number of nearby tokens) as a separate example belonging to a single class (the part-of-speech tag of the target token). A sentencebased classi er receives the whole sentence and returns a sequence of classes (the part-of-speech tag for each token in the sentence). In theory, a sentence-based classi er working from left to right can bene t from knowing part-of-speech tags of previous tokens in the sentence while classifying the next token. 2.1

Window-based classi cation: NB-SVM classi er

We experimented with windows of sizes 1 (classi cation is based on target token only, no context is used), 3 (an example consists of the target token, one token to the left and one to the right), 5, 7 and 11. We did not observe signi cant improvements for windows larger than 5 tokens.

Each token inside a window was lowercased and vectorized using bag of character n-grams representation. To distinguish pre xes and su xes from character n-grams occurred inside a token special symbols (^ and $) were added to each token as the rst and the last character. Also we added features indicating token capitalization (lowercase, uppercase, etc.). Finally we concatenated vectors for each token in the window to obtain vector representation of the window passed to the classi er.

Several window-based classi ers including Logistic Regression, Multinomial Naive Bayes and Multilayer Perceptron were tried, however the best results were obtained with our implementation of NB-SVM classi er which we will describe in details.

NB-SVM classi er was introduced for sentiment analysis and topic categorization in (Wang, 2012) paper. Later with several variations it showed excellent performance on IMDB movie reviews dataset exceeding all other single models including Recurrent Neural Networks and losing only in comparison with ensemble models with NB-SVM as one of the classi ers in an ensemble (Mesnil, 2015) . We have implemented NB-SVM classi er on top of scikit-learn library (Pedregosa, 2011) to use all advantages of this library including simple hyperparameter selection. Also we extended original NB-SVM allowing di erent scaling schemes for train and test set.

The main idea of NB-SVM is scaling input vectors before feeding them to the SVM classi er using feature-speci c weights to obtain larger values for those features which are speci c for one of the classes and smaller for those which occur uniformly across classes. Each feature value fi is multiplied by ri = log( npii==jjjjpnjjjj11 ), where pi = +P Ifyfkg = +gfifkg, ni = +P Ifyfkg = gfifkg are the sums of k k i-th feature values across positive or negative examples. The sums are smoothed by adding small to eliminate zero denominators. These weights are essentially the feature weights learnt by Multinomial Naive Bayes model (MNB), hence the name NB-SVM.

Our implementation rst trains MNB on training set, then uses learned weights to rescale training set and trains linear SVM. We also added the possibility to binarize features or scale them to [0,1] interval before rescaling by MNB weights - these transformations can be done on training set, test set or both of them. We have found that no single transformation is optimal for all cases and best results can be achieved by selecting optimal transformation like other hyperparameters. 2.2

Sentence-based classi cation: CRF

An alternative to window-based classi cation is sentence-based classi cation when a classi er accepts the whole sentence as a single example and returns a class for each token in the sentence. A sentence-based classi er can bene t from learning dependencies between classes of nearby tokens (for instance, it is more probable that an adjective is followed by a noun than a verb) and using classes of previous tokens to classify the next one.

For sentence-based classi cation we trained Conditional Random Fields (CRF) model (La erty, 2001) . For CRF we used the same features as for NB-SVM. We have implemented CRF classi er using sklearn-crfsuite2. It's a thin CRFsuite (Okazaki N., 2007) wrapper which provides interface similar to scikit-learn. 2.3

Memory baseline

The simplest model we used as baseline memorizes classes assigned to each token in training set and returns the most frequent class of the given token. If the token did not occur in the training set, the most frequent class overall is returned (NOUN in our case). We want to stress that this technique for dealing with out-of-vocabulary words used in the memory baseline only. Other approaches we tried are based on character n-grams, not words, so they do not su er from this problem. The only preprocessing we did was lowercasing which improved performance a bit. 2.4

Experiments

Since there was no separate shared task for part-of-speech tagging in MorphoRuEval17, we report the results of our own evaluation here. We used o cial train/test split of Gikrya dataset and did not use additional datasets or any other resources. The models were trained and evaluated on all 13 parts of speech occurred in training set instead of only 7 o cially evaluated in MoprhoRuEval-17 (results are much better when measured only on 7 parts of speech, but this is quite non-standard evaluation scheme). We report accuracy on test set which is the proportion of correctly classi ed tokens. For each classi er we selected the best

2 http://sklearn-crfsuite.readthedocs.io/en/latest/contributing.html

regularization and for NB-SVM we also selected the best scaling scheme using 3 fold cross-validation on training set.

Table 1 shows the results of NB-SVM compared to CRF, linear SVM over pure bag of character n-grams representation, linear SVM with a more traditional tf-idf scaling and memory baseline. Window of size 5 and n-grams of size from 1 to 5 were used for all classi ers. NB-SVM is the best model for POS-tagging improving results by 0.2% (10% error reduction) compared to linear SVM with no scaling. Tf-idf scaling does not help to improve accuracy and MNB scaling in NBSVM helps probably because the latter takes dependencies between classes and features into account. Regarding features importance we can see that padding tokens to distinguish between same character n-gram occurred as pre x, post x or inside the token helps, but capitalization features do not.

Multi-output extension of part-of-speech tagging

To solve the rst shared task of MorphoRuEval-17 not only parts of speech but also grammatical categories (case, number, tense, etc.) were required. The simplest solution often used for morphologically not-so-rich languages is to use each possible combination of part of speech and grammatical attributes as a separate class, however this showed very poor accuracy in our preliminary experiments since the number of possible classes becomes very large and most of them have very few examples.

We treated the task as multi-output classi cation, when a classi er has xed number of outputs each indicating a class of the input example according to its own criteria (grammatical category). The classes for each output are disjoint (for instance, this corresponds to impossibility for a token to be both a noun and a verb, or to be both in nominative and dative case). Disjointness of classes for each output distinguishes multi-output classi cation from multi-label classi cation, in the latter all combinations of classes are possible. 3.1

Category grouping

One of the tricks we tried is using single output for several grammatical categories. For instance, we can group number and gender and have a single output with 6 classes instead of two di erent outputs with 2 and 3 classes. In theory, it can help if some classes are not linearly separable (remember that we use linear classi ers). 3.2 To evaluate the performance of our models in the rst shared task we used the o cial MorphoRuEval-17 training / development sets3 split and the o cial evaluation script. For all of our models we report per-token accuracy on development set measured by us using the o cial script. Additionally for those of our models that were submitted to the challenge we report per-token and per-sentence accuracies averaged over three test sets measured by the challenge organizers and used for the nal participants ranking. It worth mentioning that unlike our POS-tagging evaluation in paragraph 2.4 the o cial script checks classi cation correctness not for all tokens but only for some of them belonging to several parts of speech and not for all grammatical categories but only for several of them depending on the part of speech.

The initial results that we have submitted are shown in Table 3. Dev accuracy was measured by us using the o cial development set, test accuracy shows the o cial results on test set measured by the challenge committee. Test accuracy was reported only for those models, that were submitted for the challenge. Features are the same as they were for POS-tagging, all categories were classi ed separately. Similarly to POS-tagging the best results were achieved by NB-SVM classi er. Per-token NB-SVM accuracy was the 5th best among other models, per-sentence accuracy - the 7th.

Next we tried to improve the accuracy of the best model by grouping several categories and using a single classi er for them. In our experiments with category grouping, we decided to try only combinations presented in Table 2, the evaluation of all possible combinations would take very long time.

For each group (including consisting of a single category only) optimal hyperparameters were chosen separately using 3-fold cross-validation on training set which ensures the best possible classi er performance (that's why we obtained better results even without category grouping compared to Table 3). As we can see, the most successful scheme is using single classi er for Number and

3 https://github.com/dialogue-evaluation/morphoRuEval-2017

Case and separate classi ers for all other categories. The correct grouping gives +0.6% to the accuracy. For error analysis we trained separate NB-SVM classi er for each of the 10 grammatical categories and used all of their values, not only o cially evaluated. We used window of size 5, n-grams of size from 1 to 5 and selected best regularization and train / test scaling schemes individually for each classi er using 3-fold cross-validation on train set. Then we analyzed classi ers' performance using the o cial development set.

Table 4 shows performance of NB-SVM for di erent grammatical categories. In addition to accuracy and error rate for each category we report support (the number of tokens in the development set used for evaluation) and error count (the number of tokens misclassi ed by the corresponding classi er). It should be emphasized that error counts in table 4 may not sum to the total number of misclassi ed tokens (for instance, the same token can have both case and gender misclassi ed).

The situation when the classi er returned some value for a certain grammatical category and those tokens which did not have this category in the gold standard was not considered as an error by the MorphoRuEval-17 o cial evaluation script. For instance, returning some gender tag for plural adjectives or case tag for verbs was not penalized. Hence during evaluation for each category we ignored those tokens which did not have this category (in the gold standard) which explains di erent support for each category. This also means that the error number, not the accuracy of individual classi ers a ects the overall performance most. For instance, the accuracy for Pos category is higher than for Gender category, but the support and the error number is also higher, so it can be more e ective to improve Pos classi er rst.

Table 4 shows that most errors are introduced by the Pos and Case classi ers and the Case classi er is responsible for roughly half of the errors. Misclassi cation matrix for the Case category is presented in Fig. 2. For the Case classi er more than half of the errors come from misclassifying accusative case as nominative and vice versa. The errors of the Pos classi er are more diverse, the most common one is misclassifying particles as conjunctions (14% of the errors). 5

Conclusions and future work

Part-of-speech tagging is well developed task, but it still has some places for improvement. As far as we know we are the rst who proposed using NB-SVM with character n-grams representation for POS-tagging and showed that it outperforms other linear classi ers in the rst shared task of MorphoRuEval-17 challenge, moreover it was among 5 best models. Also for this task we showed that it can be advantageous to use single classi er to jointly determine several grammatical categories instead of using separate classi er for each category. Error analyses showed that the most promising direction is to improve Case classi er, for example, to increase its prediction accuracy for Nominative and Accusative.

For the future work it will be interesting to try Recurrent Neural Networks which showed state-of-the-art results for POS-tagging of English and several other languages. Using large unlabeled corpora for unsupervised pretraining is also very promising technique because it can signi cantly improve classi cation of rare and out of vocabulary words. Sorokin, A., Shavrina, T., Lyashevskaya, O., Bocharov, V., Alexeeva, S., Droganova, K., Fenogenova, A. MorphoRuEval-2017: an evaluation track for the automatic morphological analysis methods for Russian. In Computational linguistics and intellectual technologies. Proceedings of International Workshop Dialogue'2017, Moscow. Wang, S., Manning, C., (2012), Baselines and Bigrams: Simple, Good Sentiment and Topic Classi cation. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Republic of Korea, pp. 90{94.

Brants

, ( 2000 ), Tnt - a statistical part-of-speech tagger . In Proceedings of the 6th Applied NLP Conference, ANLP-2000, April 29 - May 3 , 2000 , Seattle, WA.

La erty J. , McCallum

, Pereira

, ( 2001 ), Conditional random elds: Probabilistic models for segmenting and labeling sequence data . In Proceedings of the Eighteenth International Conference on Machine Learning , Williamstown, USA, pp. 282 { 289 .

Mesnil , G. , Mikolov , T. , Ranzato , M. , Bengio , Y. , ( 2015 ), Ensemble of Generative and Discriminative Techniques for Sentiment Analysis of Movie Reviews . Submitted to the workshop track of ICLR 2015 , available at: https://arxiv.org/abs/1412.5335.

Okazaki N. , ( 2007 ), CRFsuite: a fast implementation of Conditional Random Fields (CRFs ), available at: http://www.chokkan.org/software/crfsuite/

Plank

B.,

S gaard A ., Goldberg

, ( 2016 ), Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL) , Berlin, Germany.

Pedregosa F. , Varoquaux

G. , Gramfort

A. , Vincent

M. , Bertrand

T. , Grisel

O. , Blondel

M. , Prettenhofer

P. , Weiss

R. , Dubourg

V. , Vanderplas

J. , Passos

A. , Cournapeau

D. , ( 2011 ), "Scikit-learn: Machine Learning in Python" . Journal of Machine Learning Research . 12 : pp. 2825 { 2830 .

Santos C. , Zadrozny

B. , ( 2014 ), Learning character-level representations for part-ofspeech tagging . In ICML. In Proceedings of the 31 st International Conference on Machine Learning , Beijing, China, 2014 . JMLR: WCP vol. 32 ., pp. 1818 - 1826 .