<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Baselines and Symbol N-Grams: Simple Part-Of-Speech Tagging of Russian?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nikolay V. Arefyev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pavel A. Ermolaev</string-name>
          <email>ermolaev.p.a@yandex.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lomonosov Moscow State University</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose using NB-SVM over bag of character n-grams input representation for determining part-of-speech tags and grammatical categories like gender, number, etc. for words in Russian texts. Several methods are compared including CRF (Conditional Random Fields), SVM (Support Vector Machines) and NB-SVM (Naive Bayes SVM) and superiority of NB-SVM over other classi ers is shown. The proposed model is the 5th best among 12 other models in the rst shared task of the MorphoRuEval-17 challenge. We also experimented with category grouping when a single classi er is used to determine several grammatical categories and showed that it improves the model performance even further.</p>
      </abstract>
      <kwd-group>
        <kwd>part-of-speech tagging</kwd>
        <kwd>NB-SVM</kwd>
        <kwd>CRF</kwd>
        <kwd>multi-output classi cation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>MorphoRuEval-17 (Sorokin, 2017) challenge consists of two shared tasks. The
rst one is to determine the part of speech and a number of grammatical
categories (case, gender, number, etc.) for each word in a sentence. The second one
is lemmatization.</p>
      <p>There are various systems for the Russian language, such as Pymorphy, which
determine grammatical categories and solve lemmatization problem, but
unfortunately they do not have built-in solutions for resolving morphological
ambiguity. Because of this, all possible variants of analysis are given for each word,
which makes the practical application of such analyzers di cult.</p>
      <p>The purpose of this article is to compare di erent approaches to resolving
ambiguities. It can help selecting correct hypothesis among those returned by
Pymorphy-like systems. So we focus on the rst task of MorphoRuEval-17 and
do not perform lemmatization.</p>
      <p>
        Part-of-speech tagging is an instance of sequence labeling task which is one
of the fundamental tasks in Natural Language Processing. The di culty of
partof-speech tagging lies in ambiguous and out-of-vocabulary words which require
? The title refers to the paper Baselines and bigrams: Simple, good sentiment and
topic classi cation by Sida Wang and Christopher D. Manning which had a great
in uence on this work.
using context information as well as internal word structure. The most popular
methods of sequence labeling are Hidden Markov Models
        <xref ref-type="bibr" rid="ref1">(Brants, 2000)</xref>
        and
Conditional Random Fields
        <xref ref-type="bibr" rid="ref2">(La erty, 2001)</xref>
        . Until recently implementing a good
part-of-speech tagger required a lot of hand feature engineering, however recent
successes in deep neural networks allowed to learn necessary features during
model training. First convolutional neural networks
        <xref ref-type="bibr" rid="ref7">(dos Santos, 2014)</xref>
        and later
recurrent neural networks
        <xref ref-type="bibr" rid="ref5">(Plank, 2016)</xref>
        showed state of the art or near state
of the art results in part-of-speech tagging learning from raw texts and using
rather simple approaches compared to the previous best results. Unfortunately
neural networks require much more training data and computational resources to
show good results. For instance, we tried convolutional neural networks similar
to the ones in
        <xref ref-type="bibr" rid="ref7">(dos Santos, 2014)</xref>
        for MorphoRuEval-17 tasks but experienced
an order of magnitude larger training time compared to linear models (several
hours instead of 5-10 minutes) and were not able to make them perform better
than linear models before challenge deadline.
      </p>
      <p>Grammatical information such as number for nouns or tense for verbs is
often added to part-of-speech tags increasing the number of classes (for instance,
plural and singular nouns are usually di erent classes). This reduces the task
to standard multi-class classi cation where each example belongs to exactly one
class. However, for morphologically rich languages this approach leads to very
large number of classes and very few examples for some of them. We treat the
task as an instance of multi-output classi cation instead, i.e. for each
grammatical category we train a separate multi-class classi er.</p>
      <p>In section 2 of the paper we compare several approaches to part-of-speech
tagging on MorphoRuEval-17 dataset. All our classi ers were trained from scratch,
i.e. we used only training set provided by MorphoRuEval-17 challenge and did
not use any dictionaries (including provided), unlabeled datasets or other a
priori knowledge. In section 3 we propose an extension allowing us in addition to
part-of-speech tags determining also grammatical categories (case, gender,
number, etc.) which is necessary to solve the rst task of MorphoRuEval-17. We
describe several tricks to obtain better classi cation results. The code allowing
to reproduce our best results is publicly available 1. The main contributions of
this paper are the following:
1. We proposed using NB-SVM model with bag of character n-grams input
representation for POS-tagging and showed its superiority over linear SVM
in this scenario.
2. We introduced scikit-learn compatible NB-SVM implementation for easier
exploitation by NLP community.
3. We showed that it is bene cial to use a single classi er to jointly determine
several grammatical categories (for instance, number and case).</p>
      <sec id="sec-1-1">
        <title>1 https://github.com/nvanva/MorphoBabushka</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Part-of-speech tagging</title>
      <p>For part-of-speech tagging we tried two approaches: window-based and
sentencebased classi cation. A window-based classi er treats each window (a target token
to be classi ed with a xed number of nearby tokens) as a separate example
belonging to a single class (the part-of-speech tag of the target token). A
sentencebased classi er receives the whole sentence and returns a sequence of classes (the
part-of-speech tag for each token in the sentence). In theory, a sentence-based
classi er working from left to right can bene t from knowing part-of-speech tags
of previous tokens in the sentence while classifying the next token.
2.1</p>
      <sec id="sec-2-1">
        <title>Window-based classi cation: NB-SVM classi er</title>
        <p>We experimented with windows of sizes 1 (classi cation is based on target token
only, no context is used), 3 (an example consists of the target token, one token
to the left and one to the right), 5, 7 and 11. We did not observe signi cant
improvements for windows larger than 5 tokens.</p>
        <p>Each token inside a window was lowercased and vectorized using bag of
character n-grams representation. To distinguish pre xes and su xes from character
n-grams occurred inside a token special symbols (^ and $) were added to each
token as the rst and the last character. Also we added features indicating token
capitalization (lowercase, uppercase, etc.). Finally we concatenated vectors for
each token in the window to obtain vector representation of the window passed
to the classi er.</p>
        <p>Several window-based classi ers including Logistic Regression, Multinomial
Naive Bayes and Multilayer Perceptron were tried, however the best results were
obtained with our implementation of NB-SVM classi er which we will describe
in details.</p>
        <p>
          NB-SVM classi er was introduced for sentiment analysis and topic
categorization in (Wang, 2012) paper. Later with several variations it showed excellent
performance on IMDB movie reviews dataset exceeding all other single models
including Recurrent Neural Networks and losing only in comparison with
ensemble models with NB-SVM as one of the classi ers in an ensemble
          <xref ref-type="bibr" rid="ref3">(Mesnil,
2015)</xref>
          . We have implemented NB-SVM classi er on top of scikit-learn library
          <xref ref-type="bibr" rid="ref6">(Pedregosa, 2011)</xref>
          to use all advantages of this library including simple
hyperparameter selection. Also we extended original NB-SVM allowing di erent scaling
schemes for train and test set.
        </p>
        <p>The main idea of NB-SVM is scaling input vectors before feeding them to
the SVM classi er using feature-speci c weights to obtain larger values for those
features which are speci c for one of the classes and smaller for those which occur
uniformly across classes. Each feature value fi is multiplied by ri = log( npii==jjjjpnjjjj11 ),
where pi = +P Ifyfkg = +gfifkg, ni = +P Ifyfkg = gfifkg are the sums of
k k
i-th feature values across positive or negative examples. The sums are smoothed
by adding small to eliminate zero denominators. These weights are essentially
the feature weights learnt by Multinomial Naive Bayes model (MNB), hence the
name NB-SVM.</p>
        <p>Our implementation rst trains MNB on training set, then uses learned
weights to rescale training set and trains linear SVM. We also added the
possibility to binarize features or scale them to [0,1] interval before rescaling by MNB
weights - these transformations can be done on training set, test set or both
of them. We have found that no single transformation is optimal for all cases
and best results can be achieved by selecting optimal transformation like other
hyperparameters.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Sentence-based classi cation: CRF</title>
        <p>An alternative to window-based classi cation is sentence-based classi cation
when a classi er accepts the whole sentence as a single example and returns
a class for each token in the sentence. A sentence-based classi er can bene t
from learning dependencies between classes of nearby tokens (for instance, it is
more probable that an adjective is followed by a noun than a verb) and using
classes of previous tokens to classify the next one.</p>
        <p>
          For sentence-based classi cation we trained Conditional Random Fields (CRF)
model
          <xref ref-type="bibr" rid="ref2">(La erty, 2001)</xref>
          . For CRF we used the same features as for NB-SVM. We
have implemented CRF classi er using sklearn-crfsuite2. It's a thin CRFsuite
          <xref ref-type="bibr" rid="ref4">(Okazaki N., 2007)</xref>
          wrapper which provides interface similar to scikit-learn.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Memory baseline</title>
        <p>The simplest model we used as baseline memorizes classes assigned to each token
in training set and returns the most frequent class of the given token. If the token
did not occur in the training set, the most frequent class overall is returned
(NOUN in our case). We want to stress that this technique for dealing with
out-of-vocabulary words used in the memory baseline only. Other approaches
we tried are based on character n-grams, not words, so they do not su er from
this problem. The only preprocessing we did was lowercasing which improved
performance a bit.
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Experiments</title>
        <p>Since there was no separate shared task for part-of-speech tagging in
MorphoRuEval17, we report the results of our own evaluation here. We used o cial train/test
split of Gikrya dataset and did not use additional datasets or any other resources.
The models were trained and evaluated on all 13 parts of speech occurred in
training set instead of only 7 o cially evaluated in MoprhoRuEval-17 (results
are much better when measured only on 7 parts of speech, but this is quite
non-standard evaluation scheme). We report accuracy on test set which is the
proportion of correctly classi ed tokens. For each classi er we selected the best</p>
        <sec id="sec-2-4-1">
          <title>2 http://sklearn-crfsuite.readthedocs.io/en/latest/contributing.html</title>
          <p>regularization and for NB-SVM we also selected the best scaling scheme using 3
fold cross-validation on training set.</p>
          <p>Table 1 shows the results of NB-SVM compared to CRF, linear SVM over
pure bag of character n-grams representation, linear SVM with a more traditional
tf-idf scaling and memory baseline. Window of size 5 and n-grams of size from
1 to 5 were used for all classi ers. NB-SVM is the best model for POS-tagging
improving results by 0.2% (10% error reduction) compared to linear SVM with no
scaling. Tf-idf scaling does not help to improve accuracy and MNB scaling in
NBSVM helps probably because the latter takes dependencies between classes and
features into account. Regarding features importance we can see that padding
tokens to distinguish between same character n-gram occurred as pre x, post x
or inside the token helps, but capitalization features do not.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Multi-output extension of part-of-speech tagging</title>
      <p>To solve the rst shared task of MorphoRuEval-17 not only parts of speech but
also grammatical categories (case, number, tense, etc.) were required. The
simplest solution often used for morphologically not-so-rich languages is to use each
possible combination of part of speech and grammatical attributes as a separate
class, however this showed very poor accuracy in our preliminary experiments
since the number of possible classes becomes very large and most of them have
very few examples.</p>
      <p>We treated the task as multi-output classi cation, when a classi er has xed
number of outputs each indicating a class of the input example according to its
own criteria (grammatical category). The classes for each output are disjoint (for
instance, this corresponds to impossibility for a token to be both a noun and a
verb, or to be both in nominative and dative case). Disjointness of classes for each
output distinguishes multi-output classi cation from multi-label classi cation, in
the latter all combinations of classes are possible.
3.1</p>
      <sec id="sec-3-1">
        <title>Category grouping</title>
        <p>One of the tricks we tried is using single output for several grammatical
categories. For instance, we can group number and gender and have a single output
with 6 classes instead of two di erent outputs with 2 and 3 classes. In theory, it
can help if some classes are not linearly separable (remember that we use linear
classi ers).
3.2
To evaluate the performance of our models in the rst shared task we used the
o cial MorphoRuEval-17 training / development sets3 split and the o cial
evaluation script. For all of our models we report per-token accuracy on development
set measured by us using the o cial script. Additionally for those of our
models that were submitted to the challenge we report per-token and per-sentence
accuracies averaged over three test sets measured by the challenge organizers
and used for the nal participants ranking. It worth mentioning that unlike our
POS-tagging evaluation in paragraph 2.4 the o cial script checks classi cation
correctness not for all tokens but only for some of them belonging to several
parts of speech and not for all grammatical categories but only for several of
them depending on the part of speech.</p>
        <p>The initial results that we have submitted are shown in Table 3. Dev
accuracy was measured by us using the o cial development set, test accuracy shows
the o cial results on test set measured by the challenge committee. Test
accuracy was reported only for those models, that were submitted for the challenge.
Features are the same as they were for POS-tagging, all categories were classi ed
separately. Similarly to POS-tagging the best results were achieved by NB-SVM
classi er. Per-token NB-SVM accuracy was the 5th best among other models,
per-sentence accuracy - the 7th.</p>
        <p>Next we tried to improve the accuracy of the best model by grouping several
categories and using a single classi er for them. In our experiments with
category grouping, we decided to try only combinations presented in Table 2, the
evaluation of all possible combinations would take very long time.</p>
        <p>For each group (including consisting of a single category only) optimal
hyperparameters were chosen separately using 3-fold cross-validation on training
set which ensures the best possible classi er performance (that's why we
obtained better results even without category grouping compared to Table 3). As
we can see, the most successful scheme is using single classi er for Number and</p>
        <sec id="sec-3-1-1">
          <title>3 https://github.com/dialogue-evaluation/morphoRuEval-2017</title>
          <p>Case and separate classi ers for all other categories. The correct grouping gives
+0.6% to the accuracy.
For error analysis we trained separate NB-SVM classi er for each of the 10
grammatical categories and used all of their values, not only o cially evaluated. We
used window of size 5, n-grams of size from 1 to 5 and selected best
regularization and train / test scaling schemes individually for each classi er using 3-fold
cross-validation on train set. Then we analyzed classi ers' performance using the
o cial development set.</p>
          <p>Table 4 shows performance of NB-SVM for di erent grammatical categories.
In addition to accuracy and error rate for each category we report support (the
number of tokens in the development set used for evaluation) and error count
(the number of tokens misclassi ed by the corresponding classi er). It should
be emphasized that error counts in table 4 may not sum to the total number of
misclassi ed tokens (for instance, the same token can have both case and gender
misclassi ed).</p>
          <p>The situation when the classi er returned some value for a certain
grammatical category and those tokens which did not have this category in the gold
standard was not considered as an error by the MorphoRuEval-17 o cial
evaluation script. For instance, returning some gender tag for plural adjectives or
case tag for verbs was not penalized. Hence during evaluation for each category
we ignored those tokens which did not have this category (in the gold standard)
which explains di erent support for each category. This also means that the
error number, not the accuracy of individual classi ers a ects the overall
performance most. For instance, the accuracy for Pos category is higher than for
Gender category, but the support and the error number is also higher, so it can
be more e ective to improve Pos classi er rst.</p>
          <p>Table 4 shows that most errors are introduced by the Pos and Case classi ers
and the Case classi er is responsible for roughly half of the errors. Misclassi
cation matrix for the Case category is presented in Fig. 2. For the Case classi er
more than half of the errors come from misclassifying accusative case as
nominative and vice versa. The errors of the Pos classi er are more diverse, the most
common one is misclassifying particles as conjunctions (14% of the errors).
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and future work</title>
      <p>Part-of-speech tagging is well developed task, but it still has some places for
improvement. As far as we know we are the rst who proposed using NB-SVM
with character n-grams representation for POS-tagging and showed that it
outperforms other linear classi ers in the rst shared task of MorphoRuEval-17
challenge, moreover it was among 5 best models. Also for this task we showed
that it can be advantageous to use single classi er to jointly determine
several grammatical categories instead of using separate classi er for each category.
Error analyses showed that the most promising direction is to improve Case
classi er, for example, to increase its prediction accuracy for Nominative and
Accusative.</p>
      <p>For the future work it will be interesting to try Recurrent Neural Networks
which showed state-of-the-art results for POS-tagging of English and several
other languages. Using large unlabeled corpora for unsupervised pretraining is
also very promising technique because it can signi cantly improve classi cation
of rare and out of vocabulary words.
Sorokin, A., Shavrina, T., Lyashevskaya, O., Bocharov, V., Alexeeva, S., Droganova, K.,
Fenogenova, A. MorphoRuEval-2017: an evaluation track for the automatic
morphological analysis methods for Russian. In Computational linguistics and
intellectual technologies. Proceedings of International Workshop Dialogue'2017, Moscow.
Wang, S., Manning, C., (2012), Baselines and Bigrams: Simple, Good Sentiment and
Topic Classi cation. Proceedings of the 50th Annual Meeting of the Association
for Computational Linguistics, Jeju, Republic of Korea, pp. 90{94.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Brants</surname>
            <given-names>T.</given-names>
          </string-name>
          , (
          <year>2000</year>
          ),
          <article-title>Tnt - a statistical part-of-speech tagger</article-title>
          .
          <source>In Proceedings of the 6th Applied NLP Conference, ANLP-2000, April 29 - May 3</source>
          ,
          <year>2000</year>
          , Seattle, WA.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>La</surname>
            erty
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            <given-names>F.</given-names>
          </string-name>
          , (
          <year>2001</year>
          ),
          <article-title>Conditional random elds: Probabilistic models for segmenting and labeling sequence data</article-title>
          .
          <source>In Proceedings of the Eighteenth International Conference on Machine Learning</source>
          , Williamstown, USA, pp.
          <volume>282</volume>
          {
          <fpage>289</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Mesnil</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , (
          <year>2015</year>
          ),
          <article-title>Ensemble of Generative and Discriminative Techniques for Sentiment Analysis of Movie Reviews</article-title>
          . Submitted to the
          <source>workshop track of ICLR</source>
          <year>2015</year>
          , available at: https://arxiv.org/abs/1412.5335.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Okazaki N.</surname>
          </string-name>
          , (
          <year>2007</year>
          ),
          <article-title>CRFsuite: a fast implementation of Conditional Random Fields (CRFs</article-title>
          ), available at: http://www.chokkan.org/software/crfsuite/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Plank</surname>
            <given-names>B.,</given-names>
          </string-name>
          <article-title>S gaard A</article-title>
          .,
          <string-name>
            <surname>Goldberg</surname>
            <given-names>Y.</given-names>
          </string-name>
          , (
          <year>2016</year>
          ),
          <article-title>Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss</article-title>
          .
          <source>In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          , Berlin, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Pedregosa F.</given-names>
            ,
            <surname>Varoquaux</surname>
          </string-name>
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Gramfort</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Vincent</surname>
          </string-name>
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Bertrand</surname>
          </string-name>
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Grisel</surname>
          </string-name>
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Blondel</surname>
          </string-name>
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Prettenhofer</surname>
          </string-name>
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Weiss</surname>
          </string-name>
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Dubourg</surname>
          </string-name>
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Vanderplas</surname>
          </string-name>
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Passos</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Cournapeau</surname>
          </string-name>
          <string-name>
            <surname>D.</surname>
          </string-name>
          , (
          <year>2011</year>
          ),
          <article-title>"Scikit-learn: Machine Learning in Python"</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          .
          <volume>12</volume>
          : pp.
          <volume>2825</volume>
          {
          <fpage>2830</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Santos C.</given-names>
            ,
            <surname>Zadrozny</surname>
          </string-name>
          <string-name>
            <surname>B.</surname>
          </string-name>
          , (
          <year>2014</year>
          ),
          <article-title>Learning character-level representations for part-ofspeech tagging</article-title>
          .
          <source>In ICML. In Proceedings of the 31 st International Conference on Machine Learning</source>
          , Beijing, China,
          <year>2014</year>
          . JMLR: WCP vol.
          <volume>32</volume>
          ., pp.
          <fpage>1818</fpage>
          -
          <lpage>1826</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>