Russian Named Entities Recognition and Classification
               Using Distributed Word and Phrase Representations
      Roman Ivanitskiy                   Alexander Shipilo          Liubov Kovriguina
       ITMO University            Saint-Petersburg State University  ITMO University
        Saint-Petersburg                  Saint-Petersburg            Saint-Petersburg
             Russia                            Russia                      Russia
    litemn@yandex.ru                     ITMO University lyukovriguina@corp.ifmo.ru
                                          Saint-Petersburg
                                               Russia
                                alexandershipilo@gmail.com

                     Abstract                             using neural networks are very interesting because
                                                          the learned vectors explicitly encode many lin-
    The paper presents results on Russian                 guistic regularities and patterns, and ... many of
    named entities classification and equiva-             these patterns can be represented as linear transla-
    lent named entities retrieval using word              tions” (Mikolov et al., 2013). This paper presents
    and phrase representations. It is shown               the results of word2vec1 application to the tradi-
    that a word or an expression’s context                tional NLP task - named entity recognition (NER)
    vector is an efficient feature to be used             - for the Russian language. Results concerning
    for predicting the type of a named entity.            NER classification can contribute to the pool of
    Distributed word representations are now              evaluation data and extend existing distributional
    claimed (and on a reasonable basis) to be             semantic models for Russian, i.e., RusVectores2 .
    one of the most promising distributional                 NER recognition and classification can be suc-
    semantics models. In the described ex-                cessfully done using a large number of techniques
    periment on retrieving similar named en-              and resources, especially technologies of Seman-
    tities the results go further than retrieving         tic Web and knowledge bases like DBPedia3 ,
    named entities of the same type or named              which provides semantic search over billions of
    entities-individuals of the same class: it is         entities. DBpedia Spotlight4 , a tool for automat-
    shown that equivalent variants of a named             ically annotating mentions of DBpedia resources
    entity can be extracted. This result con-             in the text, can skip the problem of NER anno-
    tributes to the task of unsupervised enti-            tation for newswire corpora, nonfiction corpora,
    ties and semantic relations clustering and            datasets of medical records, etc. However, some
    can be used for paraphrase search and au-             genres of human discourse produce texts that lack
    tomatic ontology population. The models               such resources and demand considerable efforts
    were trained with word2vec on the Rus-                on its annotation: spoken language gives a plenty
    sian segment of parallel corpora used for             of examples of occasional abbreviations, unpre-
    statistical machine translation. Vector rep-          dictable names distortion of personalia, toponyms
    resentations were constructed and evalu-              and organizations. Moreover, there has emerged
    ated for words, lexemes and noun phrases.             a recent activity on paraphrase search. This de-
                                                          termined the interest to analyze the response of
1   Introduction
                                                          the trained word2vec model given a named entity
Model of distributed word and phrase representa-          as a stimulus. Before applying word2vec to spo-
tions introduced by Mikolov in 2013 (Mikolov et           ken corpora we decided to test its ability to clus-
al., 2013) has proved its efficiency on a variety of      ter named entities with the same label and extract
languages and tasks in natural language process-          semantic equivalents for a given named entity on
ing and got a number of extensions since its ap-          Russian segment of parallel corpora used for ma-
pearance. It provides a faster and more accurate
                                                             1
implementation of the models relying on the basic              Word2vec is a group of models (and software) for unsu-
                                                          pervised word representations learning.
idea of distributional semantics known as ”similar           2
                                                               Cf. http://ling.go.mail.ru/dsm/en/
words occur in similar contexts”. Mikolov et al.             3
                                                               Cf. http://wiki.dbpedia.org/
have shown that ”word representations computed               4
                                                               Cf. http://spotlight.dbpedia.org/


                                                    150
chine translation. Two experiments are described         ing to 635.99M words and 906K hapax legomena.
in the paper. The first one learns SVM classifier           A number of papers describe experiments that
on the FactRuEval5 training dataset, the second          go beyond word representations and ”construct
experiment analyses lists of entities with the high-     phrase embeddings by learning how to com-
est value of the cosine measure with the named           pose word embeddings using features that cap-
entity-stimulus. Both experiments are done on 4          ture phrase structure and context” (Yu and Dredze,
training models: models 1 and 2 were trained on a        2015), (Lopyrev, 2014). However, ”phrase” no-
1 billion corpus (word forms and lexemes respec-         tion in these works is quite vague and varies con-
tively) and models 3 and 4 were trained on a 100         siderably. Yin and Schultze stress that ”gener-
million corpus (a subset of the larger) which has        alized phrases ... include conventional linguistic
been annotated with noun phrases to extend word          phrases as well as skip-bigrams. ... Socher et
representations to noun phrase representations.          al. use the term ”word sequence”. Mikolov et
                                                         al. use the term ”phrase” for word sequences that
2   Related Work                                         are mostly frequent continuous collocations” (Yin
                                                         and Schütze, 2014). For the purposes of the de-
There exists a considerable number of studies on
                                                         scribed experiment accurate noun phrase extrac-
NER on English texts evaluating various types
                                                         tion is crucial, because items of the noun phrase
of algorithms, but Russian NER has been mostly
                                                         can be rare words but the whole phrase can occur
done using rule-based algorithms and pattern
                                                         in frequent contexts (about processing rare words
matching whereas recent studies focus on words
                                                         in distributed word representations models see pa-
embeddings as a feature for training NER classi-
                                                         per(Guthrie et al., 2006)).
fiers (Turian et al., 2010), on news corpora (Sienc-
nik, 2015), (Seok et al., 2016), microblog posts
                                                         3       Data Preparation
(Godin et al., 2014), (Kisa and Karagoz, 2015),
CoNLL 2003 Shared Task Corpus and Wikipedia              3.1      Datasets
articles.
                                                         Four datasets were built to train distributed word
   Segura-Bedmar et al. (Segura-Bedmar et al.,
                                                         representations on the basis of FactRuEval train-
2015) describe a machine learning approach that
                                                         ing dataset and Russian parts of parallel corpora
uses word embedding features to recognize drug
                                                         used to train statistical machine translation sys-
names from biomedical texts.           They trained
                                                         tems6 . The list of all used corpora is given below:
the Word2vec tool on two different corpora:
Wikipedia and MedLine aimed to study the effec-
                                                             • Russian subcorpus of Multilingual UN Paral-
tiveness of using word embeddings as features to
                                                               lel Text 2000—2009,
improve performance of the NER system. To eval-
uate approach and compare it with previous work,             • Europarl,
they made a series of experiments on the dataset of
SemEval-2013 Task 9.1 Drug Name Recognition.                 • News,
Demir and Ozgur (Demir and Ozgur, 2014) devel-
oped a fast unsupervised method for learning con-            • FactRuEval,
tinuous vector representations of words, and used
these representations along with language inde-              • Russian subcorpus of Yandex parallel corpus,
pendent features to develop a NER system. They
evaluated system for the highly inflectional Turk-           • Russian subcorpus of           Czech-English-
ish and Czech languages. Turkish datasets con-                 Russian parallel corpus.
tained 63.72M sentences that correspond to a total
of 1.02B words and 1.36M hapax legomena. Pub-              Total size of these corpora is 1 billion tokens.
licly available data crawled from Czech news sites         Datasets will be from now on referred to as
provided by the ACL machine translation work-            Dataset 1, Dataset 2, Dataset 3 and Dataset 4.
shop were used for the Czech language. This              They were used to train word2vec models with
dataset contained 36.42M sentences correspond-           the same indices. Basic preprocessing included re-
                                                         moval of xml/html tagging, timestamps and URLs.
  5
    Cf.               http://github.com/
                                                             6
dialogue-evaluation/factRuEval-2016                              Cf. http://www.statmt.org/


                                                   151
         Dataset 1. This corpus is built of wordforms        dataset. It has 3 basic types of named entities:
         of 1 billion corpora and has no linguistic pre-     name of persons, organizations and locations. For
         processing except tokenization. Training en-        the first experiment a string containing named en-
         tity is word form.                                  tity was sent to classifier and it produced its la-
                                                             bel. For datasets 1 and 2 evaluation dataset was
         Dataset 2.    This is 1 lemmatized bil-             cut to named entities represented by single word
         lion corpus. Tagging was performed using            forms/lexemes, datasets 3 and 4 were evaluated on
         Mystem morphological analyzer7 supporting           the whole test set (see results in Tables 2–5 of Sec-
         homonymy resolution. Training entity is lex-        tion 6). For the second experiment named entities
         eme.                                                from the training FactRuEval dataset were used as
         Dataset 3. This is 100 million subcorpus of         stimuli. For datasets 1 and 2 the stimuli list in-
         the above corpus. Training entities are word-       cluded only unigrams and for datasets 3 and 4 the
         forms and noun phrases.                             list was built of 20% unigrams and 80% of noun
                                                             phrases length from 2 to 5. Each stimulus was
         Dataset 4. This is lemmatized 100 million           fed to the trained word2vec model that generated
         subcorpus of the above corpus. Training             a response list of 10 NE-candidates having highest
         entities are lexemes and noun phrases (also         cosine measures. Candidate NEs were manually
         represented by lexemes).                            tagged as true if a candidate was a named entity
                                                             and had the same class as the stimulus, and false
                                                             otherwise. Evaluation results are presented in Ta-
3.2 Noun Phrase Extraction for Corpora 3                     ble 6, section 6.
    and 4
For the given task, a noun phrase may in-
                                                             5     Experiment Setup
clude more than one named entity, therefore, to              The overall architecture of the system can be seen
provide equal context probability smaller noun               in Fig. 1. Software used includes open source
phrases were extracted from the complex ones                 word2vec toolkit8 , Java libraries for word2vec9 ,
(i.e string ”Government of Krasnoyarsk Krai” (la-            Weka10 and NLP software mentioned in Section
bel:organization) is represented by the whole noun           3.
phrase and its smaller part: noun phrase ”Krasno-               Both experiments workflow comprises the fol-
yarsk Krai” (label:location). For these cases sen-           lowing steps:
tences are duplicated in the corpus for each em-
bedded noun phrase. Noun phrases are extracted                   1. Data collection and cleansing;
using the following procedure:
                                                                 2. Data linguistic processing (tokenization, sen-
    • input sentences are tokenized, tagged and                     tence segmentation, tagging, parsing);
      parsed using SemSin syntactic parser that                  3. NP extraction;
      produces a labelled syntactic tree for the in-
      put sentence(Kanevsky and Boyarsky, 2012);                 4. Model training and evaluation on wordforms
                                                                    (trained model 1);
    • the NP extraction algorithm finds all word se-
      quences depending from every noun within                   5. Model training on evaluation on lexemes
      the sentence and writes these sequences as a                  (trained model 2);
      candidate noun phrase;
                                                                 6. Model training and evaluation on noun
    • candidate noun phrases that contain no sym-                   phrases (trained model 3 and 4);
      bols in uppercase are filtered out.
                                                                 7. Building stimuli lists for each model;

                                                                 8. Experiment 1 on NE classification;
4       Evaluation Procedure
                                                                8
                                                                  Cf. https://code.google.com/archive/p/
System performance was evaluated using the                   word2vec
                                                                9
above mentioned manually tagged FactRuEval test                   Cf. http://deeplearning4j.org/
                                                               10
                                                                  Cf.     http://www.cs.waikato.ac.nz/ml/
    7
        Cf. https://tech.yandex.ru/mystem/                   weka/


                                                       152
Figure 1: Workflow of Named Entity Recognition System Using Distributed Word and Phrase Represen-
tations

    9. Experiment 2 on NE prediction and classifi-             machine learning or rule-based algorithms devel-
       cation;                                                 oped for the Russian language according to the
                                                               report provided by the FactRuEval committee in
10. Evaluation.                                                2016 (Starostin et al., 2016), see Table 1. In Ta-
   Experiment 1 detailed plan. SVM classifier was              ble 1 minimum and maximum values for preci-
learned on FactRuEval training set. NE word2vec                sion, recall and f-score are given. Average val-
vectors were used as feature vectors (dimension                ues for the performance of 13 NER systems that
was set to 200). FactRuEval test set was used to               took part in the competition are given in round
test the classifier that is sent a NE-unigram or a             brackets. If we compare state-of-the-art perfor-
NE-noun phrase and returns its label.                          mance with the performance of the described sys-
   Experiment 2 detailed plan. Unigrams and noun               tem (for model 4), based on distributed word rep-
phrases from the stimuli lists were sent to the                resentations approach, we can see that the system
trained word2vec models. Each model returned a                 shows average results for locations (0.86 f-score)
list of 10-best candidates for each stimulus that in-          and persons (0.89 f-score) and outperforms state-
cluded both words and phrases (for models 3 and                of-the-art systems in retrieving organizations (0.79
4). Percent of named entities having the same la-              vs 0.68 f-score). NE-unigrams are classified with
bel as the stimulus was count.                                 very high f-scores (0.99, 0.96 and 0.97 f-scores for
                                                               persons, locations and organizations respectively
6        Results and Discussion                                acc. to model 2). It can be seen from figure 3 that
                                                               points corresponding to three NE types interfere
Experiment 1: NE Label Prediction Evaluated                    less showing better classification results. This is a
on FactRuEval Training and Test Datasets                       common feature for models 3 and 4 that both were
Figures 2–5 below show output of SVM classi-                   trained on datasets containing lemmas, whereas
fier after dimensionality reduction using t-SNE al-            models 1 and 3 (see fig. 2 and 4) were trained on
gorithm11 for all 4 training models. Distribution              datasets with wordforms and the areas correspond-
of NE labels conforms with the well-known fact                 ing to each NE are very vague. Persons names are
that in many cases it is difficult or impossible to            classified with the highest f-score in all 4 models
distinguish organizations and locations12 . Classi-            that is quite predictable, because sometimes dis-
fication quality was evaluated with f-score mea-               tinguishing between locations and organizations is
sure, results are given in tables 2–5. The system              a non-trivial task (i.e. sometimes it can not be
shows competitive quality in comparison to other               made clear from the context what is mentioned -
    11
    Cf. https://lvdmaaten.github.io/tsne/
                                                               a social institute (organization) or a building it oc-
    12
    In Figures 2–5, 0 corresponds to organizations, 1 - to     cupies (location). Both for NE-single words and
locations, 2 - names of persons.


                                                         153
NE-phrases results show importance of lemmati-
zation before computing word embeddings for the
inflectional languages with rich morphology, like
Russian, even when a large corpus is used.


                                                           Figure 4: SVM Class Distribution for Model 3


 Figure 2: SVM Class Distribution for Model 1


                                                           Figure 5: SVM Class Distribution for Model 4


                                                          vector of the labelled NE. Trained model 1 pro-
                                                          duces high f-score values due to evaluation limita-
                                                          tions: in models 1 and 2 only unigrams are con-
 Figure 3: SVM Class Distribution for Model 2             sidered. Consequent comparison of trained mod-
                                                          els 2-4 confirms that quality improves when noun
                                                          phrases are predicted.
Experiment 2: NE Search and Classification                   Given a word or a phrase, word2vec is capa-
Using Word2vec Semantic Similarity Vectors                ble to retrieve linguistic units that are involved in
Quality was evaluated with f-score measure, per-          some semantic relation with the given one: syn-
cent of true positives is given in Table 5. The over-     onyms, items of the same paradigmatic class, as-
all quality is not high, still it is possible to find     sociations. But what can be found in the semantic
and predict the class of unlabelled named entities        similarity space of a named entity? In this experi-
which vectors have high cosine measure with the           ment it is assumed that among words and phrases


                                                    154
    NE                                                     Model       Location     Person    Organization
                Precision      Recall     F-score
   label                                                   Model 1      81,29        54,03       64,61
                                           0.76-           Model 2      62,17        52,13       46,90
    Lo-         0.89-0.96     0.68-0.86
                                            0.91           Model 3      55,43        57,84       46,26
   cation         (0.93)        (0.80)
                                           (0.86)          Model 4      67,63        68,21       49,45
                                           0.81-
                0.90-0.96     0.73–0.92                   Table 6: Unlabelled NE Prediction Accuracy on
   Person                                   0.93
                  (0.93)        (0.86)                    Distributed Representations.
                                           (0.89)
                                           0.39-
    Or-         0.74–0.87     0.26–0.76
                                            0.79          low some examples are provided, only English
 ganization       (0.80)        (0.61)
                                           (0.68)         translations are given. NE-stimulus is the first item
Table 1: State-of-the-art performance of NER              in the list, given in italics, the rest items are re-
systems for the Russian language.                         sponses. Equivalents (that can be paraphrases or
                                                          alternative names) are given in bold.

  NE label        Precision     Recall    F-score             • The Prosecutor General: ATTY GEN,
  Location          0.93         0.79      0.85                 ATTY GEN of Russia, RF ATTY GEN,
   Person           0.83         0.74      0.89                 Deputy Prosecutor General, RF Prosecutor
 Organization       0.81         0.94      0.77                 General, RF Prosecutor, General Prosecu-
                                                                tor Office, Prosecutor General of Russia,
  Table 2: Classification Accuracy for Model 1.                 Prosecutor General of Ukraine, Prosecutor
                                                                General of Moscow
  NE label        Precision     Recall    F-score
  Location          0.96         0.96      0.96
                                                              • Latin America:     Latin, South America,
   Person           0.99         0.99      0.99                 Counties of Latin America, Latin Ameri-
 Organization       0.96         0.98      0.97                 can countries, South-East Asia, Countries
  Table 3: Classification Accuracy for Model 2.                 of South America, China, Country, Eastern
                                                                Europe
  NE label        Precision     Recall    F-score
  Location          0.87         0.77      0.81           In most cases, the list of responses contains in-
   Person           0.80         0.88      0.84           dividuals of the same class as the stimulus: i.e.
 Organization       0.75         0.72      0.73           given the name of a region in Russia, it will return
                                                          a list of other Russian regions. Among the NE-
  Table 4: Classification Accuracy for Model 3.           candidates for the city stimuli wrongly lemma-
                                                          tized city names and toponyms misspellings were
  NE label        Precision     Recall    F-score         found, which can be also used to eliminate lemma-
  Location          0.88         0.84      0.86           tization or spelling mistakes.
   Person           0.90         0.87      0.89
 Organization       0.86         0.83      0.79           7    Future Work
  Table 5: Classification Accuracy for Model 4.
                                                          Future work implies development of a stable and
                                                          comprehensive model of distributed noun phrase
                                                          representations that will extend existing resources
                                                          for the Russian language. Admissible results on
which vectors have high cosine measure with the           NE prediction using response word2vec lists allow
vector of a named entity equivalent names of a            to continue with the experiments on NE recogni-
named entity can be found. This turned out to be          tion from noisy texts and spoken language. Abil-
true for 48% of organizations, 50% of locations,          ity of distributed word representations to capture
57% of person names (acc.to model 4). In 30%              paraphrases and lexical variants of named entities
of cases more than 3 equvalent names are found            can be used in algorithms of paraphrase search and
among first 10 responses to the NE-stimulus. Be-          similar entities and events clustering.


                                                    155
8   Acknowledgements                                      Scharolta Katharina Siencnik.    2015.    Adapting
                                                            Word2vec to Named Entity Recognition. In Pro-
This work was partially financially supported by            ceedings of the 20th Nordic Conference of Compu-
the Government of the Russian Federation, Grant             tational Linguistics (NODALIDA 2015), pages 239–
074-U01.                                                    243.
                                                          A. Starostin, Bocharov V., Alexeeva S., and Bodrova
                                                            A. 2016. FactRuEval 2016: Evaluation of Named
References                                                  Entity Recognition and Fact Extraction Systems for
Hakan Demir and Arzucan Ozgur. 2014. Improv-                Russian. In Computational Linguistics and Intellec-
  ing Named Entity Recognition for Morphologically          tual Technologies: Proceedings of the International
  Rich Languages Using Word Embeddings. In 13th             Conference “Dialogue 2016”, Moscow, June 1–4,
  International Conference on Machine Learning and          2016.
  Applications, ICMLA 2014, Detroit, MI, USA, De-
                                                          Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio.
  cember 3-6, 2014, pages 117–122.
                                                             2010. Word Representations: A Simple and General
Frederic Godin, Baptist Vandersmissen, Wesley De             Method for Semi-Supervised Learning. In Proceed-
  Neve, and Rik Van de Walle. 2014. ACL W-NUT                ings of the 48th Annual Meeting of the Association
  NER shared task: Named Entity Recognition for              for Computational Linguistics, pages 384–394, Up-
  Twitter Microposts using Distributed Word Repre-           psala, Sweden, July. Association for Computational
  sentations.                                                Linguistics.

David Guthrie, Ben Allison, Wei Liu, Louise Guthrie,      Wenpeng Yin and Hinrich Schütze. 2014. An Explo-
  and Yorick Wilks. 2006. A Closer Look at Skip-            ration of Embeddings for Generalized Phrases. In
  gram Modelling.                                           Proceedings of the 52nd Annual Meeting of the As-
                                                            sociation for Computational Linguistics, ACL 2014,
Evgeniy Kanevsky and Kirill Boyarsky. 2012. The             June 22-27, 2014, Baltimore, MD, USA, Student Re-
  Semantic-and-Syntactic Parser SEMSIN. In In-              search Workshop, pages 41–47.
  ternational conference on computational linguistics
  Dialog-2012 (-2012), Bekasovo, Russia.                  Mo Yu and Mark Dredze. 2015. Learning Composi-
                                                           tion Models for Phrase Embeddings. TACL, 3:227–
Kezban Dilek Kisa and Pinar Karagoz. 2015. Named           242.
  Entity Recognition from Scratch on Social Media.
  In Proceedings of the 6th International Workshop on
  Mining Ubiquitous and Social Environments (MUSE
  2015) co-located with the 26th European Confer-
  ence on Machine Learning / 19th European Confer-
  ence on Principles and Practice of Knowledge Dis-
  covery in Databases (ECML PKDD 2015), Porto,
  Portugal, September 7, 2015., pages 2–17.
Konstantin Lopyrev. 2014. Learning Distributed Rep-
  resentations of Phrases.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.
  Corrado, and Jeffrey Dean. 2013. Distributed Rep-
  resentations of Words and Phrases and their Com-
  positionality. In Advances in Neural Information
  Processing Systems 26: 27th Annual Conference on
  Neural Information Processing Systems 2013. Pro-
  ceedings of a meeting held December 5-8, 2013,
  Lake Tahoe, Nevada, United States., pages 3111–
  3119.
Isabel Segura-Bedmar, Vıctor Suarez-Paniagua, and
   Paloma Martınez. 2015. Exploring Word Embed-
   ding for Drug Name Recognition. In 13th Interna-
   tional Conference on Machine Learning and Appli-
   cations, ICMLA 2014, Detroit, MI, USA, December
   3-6, 2014, pages 117–122.
Miran Seok, Hye-Jeong Song, Chan-Young Park, Jong-
  Dae Kim, and Yu seop Kim. 2016. Named Entity
  Recognition using Word Embedding as a Feature.
  International Journal of Software Engineering and
  Its Applications, 10(2).


                                                    156