Recognizing Names in Islam-Related Russian Twitter © V. Mozharova © N. Loukachevitch Lomonosov Moscow State University, Moscow, Russia valerie.mozharova@gmail.com louk_nat@mail.ru Abstract. The paper describes an approach to creating a domain-specific tweet collection written by users frequently discussing Islam-related issues in Russian. We use this collection to study specific features of named entity recognition on Twitter. We found that in contrast to tweets collected randomly, our tweet collection contains relatively small number of spelling errors or strange word shortenings. Specific difficulties of our collection for named entity recognition include a large number of Arabic and other Eastern names and frequent use of ALL-CAPS spelling for emphasizing main words in messages. We studied the transfer of NER model trained on a news wire collection to the created tweet collection and approaches to decrease the degradation of the model because of the transfer. We found that for our specialized text collection, the most improvement was based on normalizing of word capitalization. Two-stage approaches to named entity recognition and Word2vec-based clustering were also useful for our task. Keywords: NER, CRF, Twitter. of Islam-oriented and other similar communities. We 1 Introduction consider the transfer of CRF-based NER recognizer from Named entity recognition (NER) is one of the basic a news data to the tweet collection and approaches to natural language processing tasks [17, 20]. Recognition decrease the degradation of the model because of the of named entities in texts is used in many other transfer. information-processing tasks as relation extraction, 2 Related works entity linking, information retrieval etc. Most studies of NER have been carried out on news collections and 2.1 Named Entity Recognition for Twitter shown high quality of named entity extraction. However, the transfer of NER recognizers to other genres of texts It is known that extraction of names from Twitter demonstrated significant decrease in the performance. messages is much more difficult task than from other genres of text because of their shortness and informal Currently, there is a great interest in information character. extraction from texts published on social media platforms such as Twitter or Facebook because these In [7] the authors review the problems and platforms can serve as a very useful (fast and/or approaches to named entity recognition and entity alternative) source of information [22]. But application linking for tweets. They write that the tweet content is of general NER recognizers designed for or trained on noisy because of incorrect spelling, irregular news collections can demonstrate the decrease in capitalization, and unusual abbreviations. In their performance of up to 50% on more in these informal experiments, the main sources of mistakes in named texts [4, 7–9]. entity recognition in tweets were violations in Another important direction of social network studies capitalization especially a large number of names written is directed to differences of language and style in specific in lower case. They studied automatic normalization of social media communities [11, 18] or their dependence tweets including spelling and capitalization correction on social and demographic characteristics of users [12, and reported that in their investigation the normalization 21]. slightly improved the performance in NER for tweets. In this paper, we consider the transfer of Russian In [24] the authors write that due to unreliable NER recognizer trained on news texts to extracting capitalization in tweets, common nouns are often names from Twitter messages. Our tweet collection is misclassified as proper nouns, and vice versa. Some specialized; it is gathered from messages of those users tweets contain all lowercase words (8%), whereas others who discuss issues related to Islam in their posts in are in ALL CAPS (0.6%). In addition to differences in contrast to other studies where Twitter collections are vocabulary, the grammar of tweets differs from news formed with random sampling of Twitter messages. This text, for example, tweets often start with a verb. In their allows us to reveal specific features of the tweet language experiments, the supervised approach was used to predict correct capitalization of words. The set of features Proceedings of the XIX International Conference included: the fraction of words in the tweet which are “Data Analytics and Management in Data Intensive capitalized, the fraction which appear in a dictionary of Domains” (DAMDID/RCDL’2017), Moscow, Russia, frequently lowercase/capitalized words but are not October 10–13, 2017 246 lowercase/capitalized in the tweet, the number of times clusters [3], Clark clusters [6]). They achieved 75.05% the word ‘I’ appears lowercase and whether or not the of F-score on two named entity types: persons (84.84%) first word in the tweet is capitalized. and organizations (71.31%). To study NER on Twitter performed with several In 2016 the FactRuEval competition for the Russian NER systems, [8] use crowdsourcing to annotate tweets language was organized. The FactRuEvaltasks included for the NER task. They annotate all @user-names as PER recognition of names in Russian news texts, recognition (person name). Annotating tweets for their experiments of specific attributes of names (family name, first name, [24] choose not to annotate @usernames mentioned in etc), and extraction of several types of facts [2]. tweets as entities because it is trivial to identify them So far, named entity recognition in tweets did not using a simple regular expression, and they would only have studied for Russian. Also, the dependence of NER serve to inate the performance statistics. performance on the language of specific Twitter user In [4] the authors study the transfer of their NER communities has not been studied before. model from news texts to tweets. They create a training set consisting of 1000 tweets. They use a baseline NER 3 Text collections model based on token and context features (wordform, 3.1 News Text Collection lemma, capitalization, prefixes and suffixes) and enhance it with two unsupervised representations We study the transfer of CRF-based NER classifier (Brown clusters and vector representations) based on a trained on newswire data to the tweet collection. For large collection of unannotated tweets. Besides, they training our system, we chose open Russian text propose a technique to combine a relatively small Twitter collection "Persons-1000", which contains 1000 news training set and larger newswire training data. They documents labeled with three types of named entities: report that two unsupervised representations work persons, organizations and locations2. The labeling rules together better than alone, and the combination of are detailed in [16]. The counts of each named entity type training sets further improves the performance of their in the collection are listed in Table 1. NER system. Table 1 The quantitative characteristics of the labeled 2.2 Named Entity Recognition in Russian named entities in text collections In Russian there is a long tradition of engineering Type News collection Twitter collection approaches to the named entity recognition task [13, 14, PER 10623 1546 23]. ORG 8541 1144 Machine-learning approaches for Russian NER LOC 7244 2836 usually employ the CRF machine learning method. In [1] OVERALL 26408 5526 the authors presented the results of the CRF-based method on various tasks, including the named entity 3.2 Tweet Text Collection recognition. The experiments were carried out on their We are interested in study of the language of Islam- own Russian text corpus, which contained 71,000 related Twitter users in Russian. To extract tweets from sentences. They used only n-grams and orthographic users discussing Islam-related issues, we created a list of features of tokens without utilizing any knowledge-based 2700 Islam terms. Then we extracted Russian tweets features. They achieved 89.89% of F-score on three mentioning these terms using Search Twitter API, got named entity types: names (93.15%), geographical users' accounts containing extracted tweets and ordered objects (92.7%), and organizations (83.83%). the accounts in the decreased number of extracted tweets In [19] the experiments utilized the open Russian text from these accounts. We found that a lot of words from collection “Persons-600”1 for the person name our list practically are not mentioned in tweets, other recognition task. The CRF-based classifier employed words (for example, “mosque” or “Muslim”) are often such features as token features, context features, and the used by very different people, not only Muslims. features based on knowledge about persons (roles, After studying tweets from extracted accounts we professions, posts, and other). They achieved 88.32% of created a very small list of the main Islam words F-score on person names. (“Allah”, “Quran”, “Prophet”, in various forms of In [10] the experiments were carried out on the Russian morphology). We also added the names of Russian text collection, which contained 97 documents. several known Islamist organizations to find their The authors used two approaches for the named entity possible non-Muslim proponents. Then we repeated the recognition: knowledge-based and CRF-based approach. whole procedure of tweet and account extraction, and In the machine learning framework they utilized such found that the extracted collections can be considered as features as the token features and the knowledge features an appropriate approximation of messages generated by based on word clustering (LDA topics [17], Brown Islam-related users. 1 2 http://ai-center.botik.ru/Airec/index.php?option http://labinform.ru/pub/named_entities/descr_ne.htm =com_content&view=article&id=27:persons- 600&catid=15&Itemid=40 247 We selected 100 users with the largest number of the 4.1.1 Token features extracted tweets, downloaded all their tweets and obtained tweet collection consisting of 300 thousand The token features include: tweets (further FullTweetCollection). Then we randomly • Token initial form (lemma); extracted tweets from different users, removed non- Russian or senseless tweets and at last obtained the tweet • Number of symbols in a token; collection of 4192 tweets (further TestTweetCollection). • Letter case. If a token begins with a capital The created collection contains messages with Quran letter, and other letters are small then the value quotes, religious and political argumentation, news- of this feature is “Big Small”. If all letters are related messages mainly about Near and Middle East capital then the value is “Big Big”. If all letters events (Syria, Iraq, Afghanistan etc) and Islamist are small then the value is “Small Small”. In organizations (Syrian opposition groups, ISIL, etc.) and other cases the value is “Fence”; also other types of messages (for example, • Token type. The value of this feature for advertisements). lexemes is the part of speech, for punctuation The obtained collection was labeled similar to marks the value is the type of punctuation; “Persons-1000”. To annotate numerous mentions of • Symbol n-grams. The presence of prefixes, Allah, we added the Deity type to the annotation scheme, suffixes and other n-grams from the predefined but in the current study we consider the Deity type as a sets in a token. subtype of the Person type. Analyzing the created collection from the point of 4.1.2 Context-based features view of NER difficulties we found that violations in The group of context features includes two feature capitalization mainly include all-caps words for the types. The first type is local context features. It takes into whole tweet and its fragment. Such capitalization is used account all mentioned token feature values of nearby for emphasizing important words in the text or words words in two-word window to the right and to the left related to Allah as in the following example: За все from the current word. потери ОН дает нам большую награду" (“For all the The second type is the bigram context feature. It losses He gives us a great reward”). Also the tweets contains information about the determined named entity mention a lot of Eastern names of persons, organizations type of the previous word. It helps to find named entity (“Фастаким Кама умирт” (Fastakim Kama Umirt borders more precisely. For example, if the person group), Джабхатфатхаш-Шам (Jabhat Fateh al-Sham)), second name is difficult for recognition, the presence of or local places difficult for correct recognition. the first name before this word makes the classification The fraction of tweets with spelling mistakes, easier. unusual shortenings is relatively low. We suppose that this is because the selected users are well-educated, they 4.1.3 Features based on lexicons are professional writers in some sense, in most cases they To improve the quality of recognition, we added are leaders of opinions, whose messages are retweeted by special lexicons with lists of useful objects. An object many other people. Therefore it is especially useful to can be a word or a phrase. The lexicons had been created study the specific features of their tweet language. before the current work and were not changed during the 4 Description of NER Model study. To calculate the lexicon features, the system matches In our study, we employ the baseline CRF-classifier the text and lexicon entries. If a token is met in a matched that utilizes token features, context features, and lexicon lexicon entry then it obtains the lexicon feature value features for NER. Then we consider the ways to improve equal to the length of the found entry. The use of the the baseline model adapting it to the Twitter language. entry length as a feature helps to diminish the affect of The adaptation techniques include the use of two stage- lexical ambiguity. For example, in the list of processing and unsupervised word clustering. Besides, organizations there is “Apple” as the name of a company. we test the impact of tweet normalization on the NER But this word does not necessarily mean a company performance. because it has the second sense of a fruit. In the opposite, 4.1 Baseline model if we found in the text the phrase “Lomonosov Moscow State University”, which is also included in the Before named entity recognition with CRF, tweets organization lexicon, the probability of the organization are processed with a morphological analyzer for sense is higher than in the first case. The lexicon feature determining the part of speech, gender, lemma, containing the matched entry length helps the system to grammatical number, case and characteristics of words. distinguish these two cases. This information is used to form features of each word for classifying. In the baseline model we consider three The biggest lexicons are listed in Table 2. The overall types of features: local token features, context features, size of all vocabularies is more than 335 thousand and features based on lexicons. entities. These lexicons were collected from from several sources: phonebooks, Russian Wikipedia, RuThes thesaurus [15]. 248 together with old ones participate in final classification. Table 2 Vocabularysizes These statistics can be collected from the current text (the Vocabulary Size, objects Clarification whole text or its part preceding to the word analysis) or Famous persons 31482 Famous people from a large text collection (collection statistics). In case First names 2773 First names of tweet processing, texts are small therefore only the Surnames 66108 Surnames collection statistics can be used. In our experiments this statistics can be obtained from the FullTweetCollection Person roles 9935 Roles, posts gathered from the selected user accounts or the labeled Verbs of 1729 Verbs that usually TestTweetCollection as described in Section 3.2. informing occur For each word, the system finds all mentions of this with persons word in the processed collection and counts frequencies Companies 33380 Organization of determined named entity types for this word. Using names these frequencies for each entity type, the system creates Company types 6774 Oranization types additional features, which have one of three values: Media 3909 Media no_one (if the word has not been recognized as a named entity of the chosen type), best (if the word has been as Geography 8969 Geographical signed to the chosen named entity type more than in 50% objects of cases), and rare (if the word has been assigned to the Geographical 1739 Geographical chosen named entity type less than in 50% of cases). adjectives adjectives Usual words 58432 Frequent Russian For example, if the word “Russia” was met 500 times words in a collection, and the classifier assigned it 200 times to (nouns, verbs, organizations and 300 times to locations, then the values adjectives) of the global statistics feature for the word “Russia” will Equipment 44094 Devices, be as following: PER –no_one, ORG – rare, LOC – best. equipment, tools 4.2 Normalization of Word Capitalization 4.2 Adaptation of NER Model to Tweets As we found that in our tweet collection the share of misprints is not very high we did normalization only for 4.2.1 Unsupervised word clustering word capitalization. The normalization was based on the In previous studies it was shown that unsupervised large news collection described in Section 4.2. For each word clustering on the basis of a large text collection word in this collection, we counted how many times the improves the NER performance. In our case we compare word was written in letter case or capital case when it the impact of word clusters calculated on a large news stands not in the beginning of a sentence. The more collection and large tweet collection. For clustering we frequent case was considered as normal for this word. use the Word2vec package1. It represents words with We considered the normalization in two variants: distributional vectors (word embeddings), computed • Variant A. All words in a tweet, except the with the neural network. The semantic similarity first one, are changed to a normal form of between two words is calculated as the cosine measure capitalization; between two corresponding vectors. The package allows • Variant B. All words in a tweet including the automatic clustering of words according to their first one are changed to a normal form of calculated similarity. We used the c-bow model with capitalization. vector sizes equal to 300. Thus, each word has an additional feature – the number of a cluster in that it appears. The news collection utilized for clustering contains two million news documents. For tweet-based clustering we use a tweet collection consisting of randomly extracted Russian tweets and including 8.3 million tweets. 4.1.1 Two-stage prediction We suppose that for adapting a classifier to a text collection it can be useful to take into account the entities already labeled by the classifier and to memorize the Figure 1 Tweets before normalization named entity type statistics for future use. We found that the variant B produces better results On the first stage the classifier extracts named and later experimented only with this variant. entities. Then the system collects the class statistics Fig. 1 presents several tweets with the manual determined in the first stage for each word and used it for annotation before normalization. Fig. 2 shows the same features of the second stage. After that, new features tweets after normalization. 1 https://github.com/dav/word2vec 249 • words that should be capitalized in this specific collection. For example, “Paradise” and “Hell” seem to be specific entities in this genre of texts, • multiword expressions in which each word is usually written in letter case, but together the multiword expression denotes a name and al least the first word should be capitalized. For example, the expression "Московский регион" (Moscow region) is normalized incorrectly because the word "московский" is written in letter case more frequently in the Russian news collection. Figure 2 Tweets after normalization Table 4 TweetPerformance Also the hashtag symbols were removed from a word if this word was found in the news collection to improve Model F-measure, F-measure, its matching with the lexicons. TestTweet- Normalized Collection TestTweet- 4 Experiments Collection 1) Baseline 64.44% 69.88% In prepocessing we remove mentioned user accounts with “@” and urls in the end of tweets. We consider these 2) Baseline + 64.99% 70.32% data as additional, as meta-information, from which we Collection statistics should not extract names. (TestTweetCollection) We train the described variants of our NER model on 3) Baseline + 65.78% 70.44% the news collection. Table 3 shows results of named Collection statistics entity recognition on the “Persons-1000” collection (FullTweetCollection) (cross-validation 3:4). It can be seen that our baseline 4) Baseline + news 66.03% 70.88% model is quite good on the news collection and slightly clusters improved after adding clustering features and the two- 5) Baseline + tweet 66.08% 70.36% step approach. In this case the collection statistics is clusters obtained from the same “Persons-1000” collection. 6) Baseline + tweet 66.23% 70.89% and news clusters Table 3 News Collection NER Performance 7) (2) + tweet and 67.27% 71.20% Model F-measure, % news clusters Baseline 92.49 8) (3) + tweet and 66.46% 69.73% Baseline+ News clusters 93.48 news clusters Baseline+ News clusters + 93.53 Collection statistics 5 Conclusion The paper describes an approach to creating a Then we apply the trained model to the test tweet domain-specific tweet collection written by users collection in initial capitalization and normalized frequently discussing Islam-related issues in Russian. capitalization. Table 4 presents the performance of NER We use this collection to study specific features of named models trained on the “Persons-1000” collection for the entity recognition on Twitter. We found that in contrast tweet data. One can see that all models significantly to tweets collected randomly, our tweet collection degrade on the tweet collection. contains relatively small number of spelling errors or The normalization significantly improves the strange word shortenings. Specific difficulties of our performance of NER (in contrast to other studies [7]). collection for named entity recognition include a large Word clustering and the collection statistics improve number of Arabic and other Eastern names (persons, both NER for initial and normalized text collections. locations, organizations) and frequent use of ALL-CAPS Their impact is larger than for the news collection (Table writing for emphasizing main words in messages. 3). The combination of tweet and news clusters was We have studied the transfer of NER model trained better than only tweet clusters possibly because of the on a newswire collection to the created tweet collection political and religious character of the gathered and approaches to decrease the degradation of the model collection. In total, the NER performance improves more because of the transfer. We found that for our specialized than 10% on tweet data. text collection, the most improvement was based on Analyzing mistakes of the best model on the normalizing of word capitalization. Two-stage normalized collection we can see still significant share of approaches to named entity recognition and word2vec- mistakes because of incorrectly normalized based clustering were also useful for our task. capitalization. We can enumerate the following main In future we plan to improve techniques of tweet subtypes of such problems: normalization and study NER for tweets of followers of • ambiguous words with different capitalization (“Earth”, the selected users. “Rose”), 250 References [12] Hovy, D.: Demographic Factors Improve Classification Performance. In: ACL-2015, pp. [1] Antonova, A.Y., Soloviev, A.N.: Conditional 752-762 (2015) Random Field Models for the Processing of [13] Khoroshevsky, V.F.: Ontology Driven Multilingual Russian. In: Int. Conf. “Dialog 2013”, pp. 27- 44. Information Extraction and Intelligent Analytics. RGGU (2013) Web Intelligence and Security. pp. 237-262 (2010) [2] Bocharov, V.V. et al.: “FactRuEval 2016: [14] Kuznetsov, I.P., Kozerenko, E.B., Kuznetsov, K.I., Evaluation of Named Entity Recognition and Fact Timonina, N.O.: Intelligent System for Entities Extraction Systems for Russian”. In: Dialog Extraction (ISEE) from Natural Language Texts. Conference. (2016) In: Int. Workshop on Conceptual Structures for [3] Brown, P.F., Della Pietra, V.J., Desouza, P.V., Extracting Natural Language Semantics-Sense, (9), Lai, J.C., Mercer, R.L.: Class-based n-gram Models pp. 17-25 (2009) of Natural Language. Computational Linguistics, [15] Nadeau, D., Sekine, S.: A Survey of Named Entity 18 (4), pp. 467-479 (1992) Recognition and Classification. [4] Cherry, C., Guo, H.: The Unreasonable LingvisticaeInvestigationes, 30 (1), pp. 3-26 (2007) Effectiveness of Word Representations for Twitter [16] Paris, C., Thomas, P., Wan, S.: Differences in Named Entity Recognition. In: NAACL-2015. pp. Language and Style Between Two Social Media 735-745 (2015) Communities. In: the 6th AAAI Int. Conf. on [5] Chrupala, G.: Efficient Induction of Probabilistic Weblogs and Social Media, ICWSM (2012) Word Classes with LDA. In: 5th Int. Joint Conf. on [17] Podobryaev, A.V.: Persons Recognition Using Natural Language Processing, IJCNLP 2011, CRF Model. In: 15th All-Russian Scientific Conf. pp. 363-372. Asian Federation of Natural Language “Digital Libraries: Advanced Methods and Processing (2011) Technologies, Digital Collection”, RCDL-2013, [6] Clark, A.: Combining Distributional and pp. 255-258. Demidov Yaroslavl State University Morphological Information for part of Speech (2013) Induction. In: 10th Conf. on European Chapter of [18] Ratinov, L., Roth, D.: Design Challenges and the Association for Computational Linguistics, Misconceptions in Named Entity Recognition. In: EACL, 1, pp. 59-66. ACL (2003) 13th Conf. on Computational Natural Language [7] Derczynski, L., Maynard, D., Rizzo, G., van Erp, Learning, CoNLL, pp. 147-155. ACL (2009) M., Gorrell, G., Troncy, R., Bontcheva, K.: [19] Ritter, A., Clark, S., Mausam, Etzioni, O.: Named Analysis of Named Entity Recognition and Linking Entity Recognition in Tweets: An Experimental for Tweets. Information Processing & Study. In: EMNLP, pp.1524-1534 (2011) Management, 51 (2), pp. 32-49 (2015) [20] Ritter, A, Etzioni, O, Clark, S. et al: Open Domain [8] Finin, T., Murnane, W., Karandikar, A, Keller, N., Event Extraction from Twitter. In: Conf. on Martineau, J., Dredze, M.: Annotating Named Knowledge Discovery and Data Mining, KDD, Entities in Twitter Data with Crowdsourcing. In: pp. 1104-1112 (2012) the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazons [21] Cherry, C., Guo, H.: The Unreasonable Mechanical Turk, pp. 80-88 (2010) Effectiveness of Word Representations for Twitter Named Entity Recognition. In: NAACL-2015. pp. [9] Fromreide, H., Hovy, D., Sogaard, A. 735-745 (2015) Crowdsourcing and Annotating NER for Twitter #drift. In LREC-2014, pp. 2544-2547 (2014) [22] Trofimov, I.V.: Person Name Recognition in News Articles Based on the Persons-1000/1111-F [10] Gareev, R., Tkachenko, M., Solovyev, V., Collections. In: 16th All-Russian Scientific Conf. Simanovsky, A., Ivanov, V.: Introducing Baselines “Digital Libraries: Advanced Methods and for Russian Named Entity Recognition. In: 14th Int. Technologies, Digital Collections”, RCDL 2014, Conf. CICLing 2013, pp. 329-342. Springer (2013) pp. 217-221 (2014) [11] Hidayatullah, A.F.: Language Tweet [23] Yang, Y., Eisenstein, J.: Putting Things in Context: Characteristics of Indonesian Citizens. In: Int. Community-specific Embedding Projections for Conf. IEEE-2015. pp. 397-401 (2015) Sentiment Analysis. arXiv preprint arXiv:1511.06052 (2015) 251