Russian Named Entities Recognition and Classification Using Distributed Word and Phrase Representations Roman Ivanitskiy Alexander Shipilo Liubov Kovriguina ITMO University Saint-Petersburg State University ITMO University Saint-Petersburg Saint-Petersburg Saint-Petersburg Russia Russia Russia litemn@yandex.ru ITMO University lyukovriguina@corp.ifmo.ru Saint-Petersburg Russia alexandershipilo@gmail.com Abstract using neural networks are very interesting because the learned vectors explicitly encode many lin- The paper presents results on Russian guistic regularities and patterns, and ... many of named entities classification and equiva- these patterns can be represented as linear transla- lent named entities retrieval using word tions” (Mikolov et al., 2013). This paper presents and phrase representations. It is shown the results of word2vec1 application to the tradi- that a word or an expression’s context tional NLP task - named entity recognition (NER) vector is an efficient feature to be used - for the Russian language. Results concerning for predicting the type of a named entity. NER classification can contribute to the pool of Distributed word representations are now evaluation data and extend existing distributional claimed (and on a reasonable basis) to be semantic models for Russian, i.e., RusVectores2 . one of the most promising distributional NER recognition and classification can be suc- semantics models. In the described ex- cessfully done using a large number of techniques periment on retrieving similar named en- and resources, especially technologies of Seman- tities the results go further than retrieving tic Web and knowledge bases like DBPedia3 , named entities of the same type or named which provides semantic search over billions of entities-individuals of the same class: it is entities. DBpedia Spotlight4 , a tool for automat- shown that equivalent variants of a named ically annotating mentions of DBpedia resources entity can be extracted. This result con- in the text, can skip the problem of NER anno- tributes to the task of unsupervised enti- tation for newswire corpora, nonfiction corpora, ties and semantic relations clustering and datasets of medical records, etc. However, some can be used for paraphrase search and au- genres of human discourse produce texts that lack tomatic ontology population. The models such resources and demand considerable efforts were trained with word2vec on the Rus- on its annotation: spoken language gives a plenty sian segment of parallel corpora used for of examples of occasional abbreviations, unpre- statistical machine translation. Vector rep- dictable names distortion of personalia, toponyms resentations were constructed and evalu- and organizations. Moreover, there has emerged ated for words, lexemes and noun phrases. a recent activity on paraphrase search. This de- termined the interest to analyze the response of 1 Introduction the trained word2vec model given a named entity Model of distributed word and phrase representa- as a stimulus. Before applying word2vec to spo- tions introduced by Mikolov in 2013 (Mikolov et ken corpora we decided to test its ability to clus- al., 2013) has proved its efficiency on a variety of ter named entities with the same label and extract languages and tasks in natural language process- semantic equivalents for a given named entity on ing and got a number of extensions since its ap- Russian segment of parallel corpora used for ma- pearance. It provides a faster and more accurate 1 implementation of the models relying on the basic Word2vec is a group of models (and software) for unsu- pervised word representations learning. idea of distributional semantics known as ”similar 2 Cf. http://ling.go.mail.ru/dsm/en/ words occur in similar contexts”. Mikolov et al. 3 Cf. http://wiki.dbpedia.org/ have shown that ”word representations computed 4 Cf. http://spotlight.dbpedia.org/ 150 chine translation. Two experiments are described ing to 635.99M words and 906K hapax legomena. in the paper. The first one learns SVM classifier A number of papers describe experiments that on the FactRuEval5 training dataset, the second go beyond word representations and ”construct experiment analyses lists of entities with the high- phrase embeddings by learning how to com- est value of the cosine measure with the named pose word embeddings using features that cap- entity-stimulus. Both experiments are done on 4 ture phrase structure and context” (Yu and Dredze, training models: models 1 and 2 were trained on a 2015), (Lopyrev, 2014). However, ”phrase” no- 1 billion corpus (word forms and lexemes respec- tion in these works is quite vague and varies con- tively) and models 3 and 4 were trained on a 100 siderably. Yin and Schultze stress that ”gener- million corpus (a subset of the larger) which has alized phrases ... include conventional linguistic been annotated with noun phrases to extend word phrases as well as skip-bigrams. ... Socher et representations to noun phrase representations. al. use the term ”word sequence”. Mikolov et al. use the term ”phrase” for word sequences that 2 Related Work are mostly frequent continuous collocations” (Yin and Schütze, 2014). For the purposes of the de- There exists a considerable number of studies on scribed experiment accurate noun phrase extrac- NER on English texts evaluating various types tion is crucial, because items of the noun phrase of algorithms, but Russian NER has been mostly can be rare words but the whole phrase can occur done using rule-based algorithms and pattern in frequent contexts (about processing rare words matching whereas recent studies focus on words in distributed word representations models see pa- embeddings as a feature for training NER classi- per(Guthrie et al., 2006)). fiers (Turian et al., 2010), on news corpora (Sienc- nik, 2015), (Seok et al., 2016), microblog posts 3 Data Preparation (Godin et al., 2014), (Kisa and Karagoz, 2015), CoNLL 2003 Shared Task Corpus and Wikipedia 3.1 Datasets articles. Four datasets were built to train distributed word Segura-Bedmar et al. (Segura-Bedmar et al., representations on the basis of FactRuEval train- 2015) describe a machine learning approach that ing dataset and Russian parts of parallel corpora uses word embedding features to recognize drug used to train statistical machine translation sys- names from biomedical texts. They trained tems6 . The list of all used corpora is given below: the Word2vec tool on two different corpora: Wikipedia and MedLine aimed to study the effec- • Russian subcorpus of Multilingual UN Paral- tiveness of using word embeddings as features to lel Text 2000—2009, improve performance of the NER system. To eval- uate approach and compare it with previous work, • Europarl, they made a series of experiments on the dataset of SemEval-2013 Task 9.1 Drug Name Recognition. • News, Demir and Ozgur (Demir and Ozgur, 2014) devel- oped a fast unsupervised method for learning con- • FactRuEval, tinuous vector representations of words, and used these representations along with language inde- • Russian subcorpus of Yandex parallel corpus, pendent features to develop a NER system. They evaluated system for the highly inflectional Turk- • Russian subcorpus of Czech-English- ish and Czech languages. Turkish datasets con- Russian parallel corpus. tained 63.72M sentences that correspond to a total of 1.02B words and 1.36M hapax legomena. Pub- Total size of these corpora is 1 billion tokens. licly available data crawled from Czech news sites Datasets will be from now on referred to as provided by the ACL machine translation work- Dataset 1, Dataset 2, Dataset 3 and Dataset 4. shop were used for the Czech language. This They were used to train word2vec models with dataset contained 36.42M sentences correspond- the same indices. Basic preprocessing included re- moval of xml/html tagging, timestamps and URLs. 5 Cf. http://github.com/ 6 dialogue-evaluation/factRuEval-2016 Cf. http://www.statmt.org/ 151 Dataset 1. This corpus is built of wordforms dataset. It has 3 basic types of named entities: of 1 billion corpora and has no linguistic pre- name of persons, organizations and locations. For processing except tokenization. Training en- the first experiment a string containing named en- tity is word form. tity was sent to classifier and it produced its la- bel. For datasets 1 and 2 evaluation dataset was Dataset 2. This is 1 lemmatized bil- cut to named entities represented by single word lion corpus. Tagging was performed using forms/lexemes, datasets 3 and 4 were evaluated on Mystem morphological analyzer7 supporting the whole test set (see results in Tables 2–5 of Sec- homonymy resolution. Training entity is lex- tion 6). For the second experiment named entities eme. from the training FactRuEval dataset were used as Dataset 3. This is 100 million subcorpus of stimuli. For datasets 1 and 2 the stimuli list in- the above corpus. Training entities are word- cluded only unigrams and for datasets 3 and 4 the forms and noun phrases. list was built of 20% unigrams and 80% of noun phrases length from 2 to 5. Each stimulus was Dataset 4. This is lemmatized 100 million fed to the trained word2vec model that generated subcorpus of the above corpus. Training a response list of 10 NE-candidates having highest entities are lexemes and noun phrases (also cosine measures. Candidate NEs were manually represented by lexemes). tagged as true if a candidate was a named entity and had the same class as the stimulus, and false otherwise. Evaluation results are presented in Ta- 3.2 Noun Phrase Extraction for Corpora 3 ble 6, section 6. and 4 For the given task, a noun phrase may in- 5 Experiment Setup clude more than one named entity, therefore, to The overall architecture of the system can be seen provide equal context probability smaller noun in Fig. 1. Software used includes open source phrases were extracted from the complex ones word2vec toolkit8 , Java libraries for word2vec9 , (i.e string ”Government of Krasnoyarsk Krai” (la- Weka10 and NLP software mentioned in Section bel:organization) is represented by the whole noun 3. phrase and its smaller part: noun phrase ”Krasno- Both experiments workflow comprises the fol- yarsk Krai” (label:location). For these cases sen- lowing steps: tences are duplicated in the corpus for each em- bedded noun phrase. Noun phrases are extracted 1. Data collection and cleansing; using the following procedure: 2. Data linguistic processing (tokenization, sen- • input sentences are tokenized, tagged and tence segmentation, tagging, parsing); parsed using SemSin syntactic parser that 3. NP extraction; produces a labelled syntactic tree for the in- put sentence(Kanevsky and Boyarsky, 2012); 4. Model training and evaluation on wordforms (trained model 1); • the NP extraction algorithm finds all word se- quences depending from every noun within 5. Model training on evaluation on lexemes the sentence and writes these sequences as a (trained model 2); candidate noun phrase; 6. Model training and evaluation on noun • candidate noun phrases that contain no sym- phrases (trained model 3 and 4); bols in uppercase are filtered out. 7. Building stimuli lists for each model; 8. Experiment 1 on NE classification; 4 Evaluation Procedure 8 Cf. https://code.google.com/archive/p/ System performance was evaluated using the word2vec 9 above mentioned manually tagged FactRuEval test Cf. http://deeplearning4j.org/ 10 Cf. http://www.cs.waikato.ac.nz/ml/ 7 Cf. https://tech.yandex.ru/mystem/ weka/ 152 Figure 1: Workflow of Named Entity Recognition System Using Distributed Word and Phrase Represen- tations 9. Experiment 2 on NE prediction and classifi- machine learning or rule-based algorithms devel- cation; oped for the Russian language according to the report provided by the FactRuEval committee in 10. Evaluation. 2016 (Starostin et al., 2016), see Table 1. In Ta- Experiment 1 detailed plan. SVM classifier was ble 1 minimum and maximum values for preci- learned on FactRuEval training set. NE word2vec sion, recall and f-score are given. Average val- vectors were used as feature vectors (dimension ues for the performance of 13 NER systems that was set to 200). FactRuEval test set was used to took part in the competition are given in round test the classifier that is sent a NE-unigram or a brackets. If we compare state-of-the-art perfor- NE-noun phrase and returns its label. mance with the performance of the described sys- Experiment 2 detailed plan. Unigrams and noun tem (for model 4), based on distributed word rep- phrases from the stimuli lists were sent to the resentations approach, we can see that the system trained word2vec models. Each model returned a shows average results for locations (0.86 f-score) list of 10-best candidates for each stimulus that in- and persons (0.89 f-score) and outperforms state- cluded both words and phrases (for models 3 and of-the-art systems in retrieving organizations (0.79 4). Percent of named entities having the same la- vs 0.68 f-score). NE-unigrams are classified with bel as the stimulus was count. very high f-scores (0.99, 0.96 and 0.97 f-scores for persons, locations and organizations respectively 6 Results and Discussion acc. to model 2). It can be seen from figure 3 that points corresponding to three NE types interfere Experiment 1: NE Label Prediction Evaluated less showing better classification results. This is a on FactRuEval Training and Test Datasets common feature for models 3 and 4 that both were Figures 2–5 below show output of SVM classi- trained on datasets containing lemmas, whereas fier after dimensionality reduction using t-SNE al- models 1 and 3 (see fig. 2 and 4) were trained on gorithm11 for all 4 training models. Distribution datasets with wordforms and the areas correspond- of NE labels conforms with the well-known fact ing to each NE are very vague. Persons names are that in many cases it is difficult or impossible to classified with the highest f-score in all 4 models distinguish organizations and locations12 . Classi- that is quite predictable, because sometimes dis- fication quality was evaluated with f-score mea- tinguishing between locations and organizations is sure, results are given in tables 2–5. The system a non-trivial task (i.e. sometimes it can not be shows competitive quality in comparison to other made clear from the context what is mentioned - 11 Cf. https://lvdmaaten.github.io/tsne/ a social institute (organization) or a building it oc- 12 In Figures 2–5, 0 corresponds to organizations, 1 - to cupies (location). Both for NE-single words and locations, 2 - names of persons. 153 NE-phrases results show importance of lemmati- zation before computing word embeddings for the inflectional languages with rich morphology, like Russian, even when a large corpus is used. Figure 4: SVM Class Distribution for Model 3 Figure 2: SVM Class Distribution for Model 1 Figure 5: SVM Class Distribution for Model 4 vector of the labelled NE. Trained model 1 pro- duces high f-score values due to evaluation limita- tions: in models 1 and 2 only unigrams are con- Figure 3: SVM Class Distribution for Model 2 sidered. Consequent comparison of trained mod- els 2-4 confirms that quality improves when noun phrases are predicted. Experiment 2: NE Search and Classification Given a word or a phrase, word2vec is capa- Using Word2vec Semantic Similarity Vectors ble to retrieve linguistic units that are involved in Quality was evaluated with f-score measure, per- some semantic relation with the given one: syn- cent of true positives is given in Table 5. The over- onyms, items of the same paradigmatic class, as- all quality is not high, still it is possible to find sociations. But what can be found in the semantic and predict the class of unlabelled named entities similarity space of a named entity? In this experi- which vectors have high cosine measure with the ment it is assumed that among words and phrases 154 NE Model Location Person Organization Precision Recall F-score label Model 1 81,29 54,03 64,61 0.76- Model 2 62,17 52,13 46,90 Lo- 0.89-0.96 0.68-0.86 0.91 Model 3 55,43 57,84 46,26 cation (0.93) (0.80) (0.86) Model 4 67,63 68,21 49,45 0.81- 0.90-0.96 0.73–0.92 Table 6: Unlabelled NE Prediction Accuracy on Person 0.93 (0.93) (0.86) Distributed Representations. (0.89) 0.39- Or- 0.74–0.87 0.26–0.76 0.79 low some examples are provided, only English ganization (0.80) (0.61) (0.68) translations are given. NE-stimulus is the first item Table 1: State-of-the-art performance of NER in the list, given in italics, the rest items are re- systems for the Russian language. sponses. Equivalents (that can be paraphrases or alternative names) are given in bold. NE label Precision Recall F-score • The Prosecutor General: ATTY GEN, Location 0.93 0.79 0.85 ATTY GEN of Russia, RF ATTY GEN, Person 0.83 0.74 0.89 Deputy Prosecutor General, RF Prosecutor Organization 0.81 0.94 0.77 General, RF Prosecutor, General Prosecu- tor Office, Prosecutor General of Russia, Table 2: Classification Accuracy for Model 1. Prosecutor General of Ukraine, Prosecutor General of Moscow NE label Precision Recall F-score Location 0.96 0.96 0.96 • Latin America: Latin, South America, Person 0.99 0.99 0.99 Counties of Latin America, Latin Ameri- Organization 0.96 0.98 0.97 can countries, South-East Asia, Countries Table 3: Classification Accuracy for Model 2. of South America, China, Country, Eastern Europe NE label Precision Recall F-score Location 0.87 0.77 0.81 In most cases, the list of responses contains in- Person 0.80 0.88 0.84 dividuals of the same class as the stimulus: i.e. Organization 0.75 0.72 0.73 given the name of a region in Russia, it will return a list of other Russian regions. Among the NE- Table 4: Classification Accuracy for Model 3. candidates for the city stimuli wrongly lemma- tized city names and toponyms misspellings were NE label Precision Recall F-score found, which can be also used to eliminate lemma- Location 0.88 0.84 0.86 tization or spelling mistakes. Person 0.90 0.87 0.89 Organization 0.86 0.83 0.79 7 Future Work Table 5: Classification Accuracy for Model 4. Future work implies development of a stable and comprehensive model of distributed noun phrase representations that will extend existing resources for the Russian language. Admissible results on which vectors have high cosine measure with the NE prediction using response word2vec lists allow vector of a named entity equivalent names of a to continue with the experiments on NE recogni- named entity can be found. This turned out to be tion from noisy texts and spoken language. Abil- true for 48% of organizations, 50% of locations, ity of distributed word representations to capture 57% of person names (acc.to model 4). In 30% paraphrases and lexical variants of named entities of cases more than 3 equvalent names are found can be used in algorithms of paraphrase search and among first 10 responses to the NE-stimulus. Be- similar entities and events clustering. 155 8 Acknowledgements Scharolta Katharina Siencnik. 2015. Adapting Word2vec to Named Entity Recognition. In Pro- This work was partially financially supported by ceedings of the 20th Nordic Conference of Compu- the Government of the Russian Federation, Grant tational Linguistics (NODALIDA 2015), pages 239– 074-U01. 243. A. Starostin, Bocharov V., Alexeeva S., and Bodrova A. 2016. FactRuEval 2016: Evaluation of Named References Entity Recognition and Fact Extraction Systems for Hakan Demir and Arzucan Ozgur. 2014. Improv- Russian. In Computational Linguistics and Intellec- ing Named Entity Recognition for Morphologically tual Technologies: Proceedings of the International Rich Languages Using Word Embeddings. In 13th Conference “Dialogue 2016”, Moscow, June 1–4, International Conference on Machine Learning and 2016. Applications, ICMLA 2014, Detroit, MI, USA, De- Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. cember 3-6, 2014, pages 117–122. 2010. Word Representations: A Simple and General Frederic Godin, Baptist Vandersmissen, Wesley De Method for Semi-Supervised Learning. In Proceed- Neve, and Rik Van de Walle. 2014. ACL W-NUT ings of the 48th Annual Meeting of the Association NER shared task: Named Entity Recognition for for Computational Linguistics, pages 384–394, Up- Twitter Microposts using Distributed Word Repre- psala, Sweden, July. Association for Computational sentations. Linguistics. David Guthrie, Ben Allison, Wei Liu, Louise Guthrie, Wenpeng Yin and Hinrich Schütze. 2014. An Explo- and Yorick Wilks. 2006. A Closer Look at Skip- ration of Embeddings for Generalized Phrases. In gram Modelling. Proceedings of the 52nd Annual Meeting of the As- sociation for Computational Linguistics, ACL 2014, Evgeniy Kanevsky and Kirill Boyarsky. 2012. The June 22-27, 2014, Baltimore, MD, USA, Student Re- Semantic-and-Syntactic Parser SEMSIN. In In- search Workshop, pages 41–47. ternational conference on computational linguistics Dialog-2012 (-2012), Bekasovo, Russia. Mo Yu and Mark Dredze. 2015. Learning Composi- tion Models for Phrase Embeddings. TACL, 3:227– Kezban Dilek Kisa and Pinar Karagoz. 2015. Named 242. Entity Recognition from Scratch on Social Media. In Proceedings of the 6th International Workshop on Mining Ubiquitous and Social Environments (MUSE 2015) co-located with the 26th European Confer- ence on Machine Learning / 19th European Confer- ence on Principles and Practice of Knowledge Dis- covery in Databases (ECML PKDD 2015), Porto, Portugal, September 7, 2015., pages 2–17. Konstantin Lopyrev. 2014. Learning Distributed Rep- resentations of Phrases. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Rep- resentations of Words and Phrases and their Com- positionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Pro- ceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 3111– 3119. Isabel Segura-Bedmar, Vıctor Suarez-Paniagua, and Paloma Martınez. 2015. Exploring Word Embed- ding for Drug Name Recognition. In 13th Interna- tional Conference on Machine Learning and Appli- cations, ICMLA 2014, Detroit, MI, USA, December 3-6, 2014, pages 117–122. Miran Seok, Hye-Jeong Song, Chan-Young Park, Jong- Dae Kim, and Yu seop Kim. 2016. Named Entity Recognition using Word Embedding as a Feature. International Journal of Software Engineering and Its Applications, 10(2). 156