=Paper=
{{Paper
|id=Vol-2022/paper40
|storemode=property
|title=
Recognizing Names in Islam-Related Russian Twitter
|pdfUrl=https://ceur-ws.org/Vol-2022/paper40.pdf
|volume=Vol-2022
|authors=Valerie Mozharova,Natalia Loukachevitch
|dblpUrl=https://dblp.org/rec/conf/rcdl/MozharovaL17
}}
==
Recognizing Names in Islam-Related Russian Twitter
==
Recognizing Names in Islam-Related Russian Twitter
© V. Mozharova © N. Loukachevitch
Lomonosov Moscow State University,
Moscow, Russia
valerie.mozharova@gmail.com louk_nat@mail.ru
Abstract. The paper describes an approach to creating a domain-specific tweet collection written by
users frequently discussing Islam-related issues in Russian. We use this collection to study specific features
of named entity recognition on Twitter. We found that in contrast to tweets collected randomly, our tweet
collection contains relatively small number of spelling errors or strange word shortenings. Specific difficulties
of our collection for named entity recognition include a large number of Arabic and other Eastern names and
frequent use of ALL-CAPS spelling for emphasizing main words in messages. We studied the transfer of
NER model trained on a news wire collection to the created tweet collection and approaches to decrease the
degradation of the model because of the transfer. We found that for our specialized text collection, the most
improvement was based on normalizing of word capitalization. Two-stage approaches to named entity
recognition and Word2vec-based clustering were also useful for our task.
Keywords: NER, CRF, Twitter.
of Islam-oriented and other similar communities. We
1 Introduction consider the transfer of CRF-based NER recognizer from
Named entity recognition (NER) is one of the basic a news data to the tweet collection and approaches to
natural language processing tasks [17, 20]. Recognition decrease the degradation of the model because of the
of named entities in texts is used in many other transfer.
information-processing tasks as relation extraction,
2 Related works
entity linking, information retrieval etc. Most studies of
NER have been carried out on news collections and 2.1 Named Entity Recognition for Twitter
shown high quality of named entity extraction. However,
the transfer of NER recognizers to other genres of texts It is known that extraction of names from Twitter
demonstrated significant decrease in the performance. messages is much more difficult task than from other
genres of text because of their shortness and informal
Currently, there is a great interest in information
character.
extraction from texts published on social media
platforms such as Twitter or Facebook because these In [7] the authors review the problems and
platforms can serve as a very useful (fast and/or approaches to named entity recognition and entity
alternative) source of information [22]. But application linking for tweets. They write that the tweet content is
of general NER recognizers designed for or trained on noisy because of incorrect spelling, irregular
news collections can demonstrate the decrease in capitalization, and unusual abbreviations. In their
performance of up to 50% on more in these informal experiments, the main sources of mistakes in named
texts [4, 7–9]. entity recognition in tweets were violations in
Another important direction of social network studies capitalization especially a large number of names written
is directed to differences of language and style in specific in lower case. They studied automatic normalization of
social media communities [11, 18] or their dependence tweets including spelling and capitalization correction
on social and demographic characteristics of users [12, and reported that in their investigation the normalization
21]. slightly improved the performance in NER for tweets.
In this paper, we consider the transfer of Russian In [24] the authors write that due to unreliable
NER recognizer trained on news texts to extracting capitalization in tweets, common nouns are often
names from Twitter messages. Our tweet collection is misclassified as proper nouns, and vice versa. Some
specialized; it is gathered from messages of those users tweets contain all lowercase words (8%), whereas others
who discuss issues related to Islam in their posts in are in ALL CAPS (0.6%). In addition to differences in
contrast to other studies where Twitter collections are vocabulary, the grammar of tweets differs from news
formed with random sampling of Twitter messages. This text, for example, tweets often start with a verb. In their
allows us to reveal specific features of the tweet language experiments, the supervised approach was used to predict
correct capitalization of words. The set of features
Proceedings of the XIX International Conference included: the fraction of words in the tweet which are
“Data Analytics and Management in Data Intensive capitalized, the fraction which appear in a dictionary of
Domains” (DAMDID/RCDL’2017), Moscow, Russia, frequently lowercase/capitalized words but are not
October 10–13, 2017
246
lowercase/capitalized in the tweet, the number of times clusters [3], Clark clusters [6]). They achieved 75.05%
the word ‘I’ appears lowercase and whether or not the of F-score on two named entity types: persons (84.84%)
first word in the tweet is capitalized. and organizations (71.31%).
To study NER on Twitter performed with several In 2016 the FactRuEval competition for the Russian
NER systems, [8] use crowdsourcing to annotate tweets language was organized. The FactRuEvaltasks included
for the NER task. They annotate all @user-names as PER recognition of names in Russian news texts, recognition
(person name). Annotating tweets for their experiments of specific attributes of names (family name, first name,
[24] choose not to annotate @usernames mentioned in etc), and extraction of several types of facts [2].
tweets as entities because it is trivial to identify them So far, named entity recognition in tweets did not
using a simple regular expression, and they would only have studied for Russian. Also, the dependence of NER
serve to inate the performance statistics. performance on the language of specific Twitter user
In [4] the authors study the transfer of their NER communities has not been studied before.
model from news texts to tweets. They create a training
set consisting of 1000 tweets. They use a baseline NER 3 Text collections
model based on token and context features (wordform,
3.1 News Text Collection
lemma, capitalization, prefixes and suffixes) and
enhance it with two unsupervised representations We study the transfer of CRF-based NER classifier
(Brown clusters and vector representations) based on a trained on newswire data to the tweet collection. For
large collection of unannotated tweets. Besides, they training our system, we chose open Russian text
propose a technique to combine a relatively small Twitter collection "Persons-1000", which contains 1000 news
training set and larger newswire training data. They documents labeled with three types of named entities:
report that two unsupervised representations work persons, organizations and locations2. The labeling rules
together better than alone, and the combination of are detailed in [16]. The counts of each named entity type
training sets further improves the performance of their in the collection are listed in Table 1.
NER system.
Table 1 The quantitative characteristics of the labeled
2.2 Named Entity Recognition in Russian
named entities in text collections
In Russian there is a long tradition of engineering Type News collection Twitter collection
approaches to the named entity recognition task [13, 14, PER 10623 1546
23]. ORG 8541 1144
Machine-learning approaches for Russian NER LOC 7244 2836
usually employ the CRF machine learning method. In [1] OVERALL 26408 5526
the authors presented the results of the CRF-based
method on various tasks, including the named entity 3.2 Tweet Text Collection
recognition. The experiments were carried out on their
We are interested in study of the language of Islam-
own Russian text corpus, which contained 71,000
related Twitter users in Russian. To extract tweets from
sentences. They used only n-grams and orthographic
users discussing Islam-related issues, we created a list of
features of tokens without utilizing any knowledge-based
2700 Islam terms. Then we extracted Russian tweets
features. They achieved 89.89% of F-score on three
mentioning these terms using Search Twitter API, got
named entity types: names (93.15%), geographical
users' accounts containing extracted tweets and ordered
objects (92.7%), and organizations (83.83%).
the accounts in the decreased number of extracted tweets
In [19] the experiments utilized the open Russian text from these accounts. We found that a lot of words from
collection “Persons-600”1 for the person name our list practically are not mentioned in tweets, other
recognition task. The CRF-based classifier employed words (for example, “mosque” or “Muslim”) are often
such features as token features, context features, and the used by very different people, not only Muslims.
features based on knowledge about persons (roles, After studying tweets from extracted accounts we
professions, posts, and other). They achieved 88.32% of created a very small list of the main Islam words
F-score on person names. (“Allah”, “Quran”, “Prophet”, in various forms of
In [10] the experiments were carried out on the Russian morphology). We also added the names of
Russian text collection, which contained 97 documents. several known Islamist organizations to find their
The authors used two approaches for the named entity possible non-Muslim proponents. Then we repeated the
recognition: knowledge-based and CRF-based approach. whole procedure of tweet and account extraction, and
In the machine learning framework they utilized such found that the extracted collections can be considered as
features as the token features and the knowledge features an appropriate approximation of messages generated by
based on word clustering (LDA topics [17], Brown Islam-related users.
1 2
http://ai-center.botik.ru/Airec/index.php?option http://labinform.ru/pub/named_entities/descr_ne.htm
=com_content&view=article&id=27:persons-
600&catid=15&Itemid=40
247
We selected 100 users with the largest number of the 4.1.1 Token features
extracted tweets, downloaded all their tweets and
obtained tweet collection consisting of 300 thousand The token features include:
tweets (further FullTweetCollection). Then we randomly • Token initial form (lemma);
extracted tweets from different users, removed non-
Russian or senseless tweets and at last obtained the tweet • Number of symbols in a token;
collection of 4192 tweets (further TestTweetCollection). • Letter case. If a token begins with a capital
The created collection contains messages with Quran letter, and other letters are small then the value
quotes, religious and political argumentation, news- of this feature is “Big Small”. If all letters are
related messages mainly about Near and Middle East capital then the value is “Big Big”. If all letters
events (Syria, Iraq, Afghanistan etc) and Islamist are small then the value is “Small Small”. In
organizations (Syrian opposition groups, ISIL, etc.) and other cases the value is “Fence”;
also other types of messages (for example, • Token type. The value of this feature for
advertisements). lexemes is the part of speech, for punctuation
The obtained collection was labeled similar to marks the value is the type of punctuation;
“Persons-1000”. To annotate numerous mentions of • Symbol n-grams. The presence of prefixes,
Allah, we added the Deity type to the annotation scheme, suffixes and other n-grams from the predefined
but in the current study we consider the Deity type as a sets in a token.
subtype of the Person type.
Analyzing the created collection from the point of 4.1.2 Context-based features
view of NER difficulties we found that violations in The group of context features includes two feature
capitalization mainly include all-caps words for the types. The first type is local context features. It takes into
whole tweet and its fragment. Such capitalization is used account all mentioned token feature values of nearby
for emphasizing important words in the text or words words in two-word window to the right and to the left
related to Allah as in the following example: За все from the current word.
потери ОН дает нам большую награду" (“For all the
The second type is the bigram context feature. It
losses He gives us a great reward”). Also the tweets contains information about the determined named entity
mention a lot of Eastern names of persons, organizations
type of the previous word. It helps to find named entity
(“Фастаким Кама умирт” (Fastakim Kama Umirt
borders more precisely. For example, if the person
group), Джабхатфатхаш-Шам (Jabhat Fateh al-Sham)), second name is difficult for recognition, the presence of
or local places difficult for correct recognition. the first name before this word makes the classification
The fraction of tweets with spelling mistakes, easier.
unusual shortenings is relatively low. We suppose that
this is because the selected users are well-educated, they 4.1.3 Features based on lexicons
are professional writers in some sense, in most cases they
To improve the quality of recognition, we added
are leaders of opinions, whose messages are retweeted by
special lexicons with lists of useful objects. An object
many other people. Therefore it is especially useful to
can be a word or a phrase. The lexicons had been created
study the specific features of their tweet language.
before the current work and were not changed during the
4 Description of NER Model study.
To calculate the lexicon features, the system matches
In our study, we employ the baseline CRF-classifier
the text and lexicon entries. If a token is met in a matched
that utilizes token features, context features, and lexicon
lexicon entry then it obtains the lexicon feature value
features for NER. Then we consider the ways to improve
equal to the length of the found entry. The use of the
the baseline model adapting it to the Twitter language.
entry length as a feature helps to diminish the affect of
The adaptation techniques include the use of two stage-
lexical ambiguity. For example, in the list of
processing and unsupervised word clustering. Besides,
organizations there is “Apple” as the name of a company.
we test the impact of tweet normalization on the NER
But this word does not necessarily mean a company
performance.
because it has the second sense of a fruit. In the opposite,
4.1 Baseline model if we found in the text the phrase “Lomonosov Moscow
State University”, which is also included in the
Before named entity recognition with CRF, tweets
organization lexicon, the probability of the organization
are processed with a morphological analyzer for
sense is higher than in the first case. The lexicon feature
determining the part of speech, gender, lemma,
containing the matched entry length helps the system to
grammatical number, case and characteristics of words.
distinguish these two cases.
This information is used to form features of each word
for classifying. In the baseline model we consider three The biggest lexicons are listed in Table 2. The overall
types of features: local token features, context features, size of all vocabularies is more than 335 thousand
and features based on lexicons. entities. These lexicons were collected from from several
sources: phonebooks, Russian Wikipedia, RuThes
thesaurus [15].
248
together with old ones participate in final classification.
Table 2 Vocabularysizes These statistics can be collected from the current text (the
Vocabulary Size, objects Clarification whole text or its part preceding to the word analysis) or
Famous persons 31482 Famous people from a large text collection (collection statistics). In case
First names 2773 First names of tweet processing, texts are small therefore only the
Surnames 66108 Surnames collection statistics can be used. In our experiments this
statistics can be obtained from the FullTweetCollection
Person roles 9935 Roles, posts
gathered from the selected user accounts or the labeled
Verbs of 1729 Verbs that usually
TestTweetCollection as described in Section 3.2.
informing occur
For each word, the system finds all mentions of this
with persons
word in the processed collection and counts frequencies
Companies 33380 Organization of determined named entity types for this word. Using
names these frequencies for each entity type, the system creates
Company types 6774 Oranization types additional features, which have one of three values:
Media 3909 Media no_one (if the word has not been recognized as a named
entity of the chosen type), best (if the word has been as
Geography 8969 Geographical
signed to the chosen named entity type more than in 50%
objects
of cases), and rare (if the word has been assigned to the
Geographical 1739 Geographical
chosen named entity type less than in 50% of cases).
adjectives adjectives
Usual words 58432 Frequent Russian For example, if the word “Russia” was met 500 times
words in a collection, and the classifier assigned it 200 times to
(nouns, verbs, organizations and 300 times to locations, then the values
adjectives) of the global statistics feature for the word “Russia” will
Equipment 44094 Devices, be as following: PER –no_one, ORG – rare, LOC – best.
equipment, tools 4.2 Normalization of Word Capitalization
4.2 Adaptation of NER Model to Tweets As we found that in our tweet collection the share of
misprints is not very high we did normalization only for
4.2.1 Unsupervised word clustering word capitalization. The normalization was based on the
In previous studies it was shown that unsupervised large news collection described in Section 4.2. For each
word clustering on the basis of a large text collection word in this collection, we counted how many times the
improves the NER performance. In our case we compare word was written in letter case or capital case when it
the impact of word clusters calculated on a large news stands not in the beginning of a sentence. The more
collection and large tweet collection. For clustering we frequent case was considered as normal for this word.
use the Word2vec package1. It represents words with We considered the normalization in two variants:
distributional vectors (word embeddings), computed • Variant A. All words in a tweet, except the
with the neural network. The semantic similarity first one, are changed to a normal form of
between two words is calculated as the cosine measure capitalization;
between two corresponding vectors. The package allows • Variant B. All words in a tweet including the
automatic clustering of words according to their first one are changed to a normal form of
calculated similarity. We used the c-bow model with capitalization.
vector sizes equal to 300. Thus, each word has an
additional feature – the number of a cluster in that it
appears. The news collection utilized for clustering
contains two million news documents. For tweet-based
clustering we use a tweet collection consisting of
randomly extracted Russian tweets and including 8.3
million tweets.
4.1.1 Two-stage prediction
We suppose that for adapting a classifier to a text
collection it can be useful to take into account the entities
already labeled by the classifier and to memorize the Figure 1 Tweets before normalization
named entity type statistics for future use. We found that the variant B produces better results
On the first stage the classifier extracts named and later experimented only with this variant.
entities. Then the system collects the class statistics Fig. 1 presents several tweets with the manual
determined in the first stage for each word and used it for annotation before normalization. Fig. 2 shows the same
features of the second stage. After that, new features tweets after normalization.
1
https://github.com/dav/word2vec
249
• words that should be capitalized in this specific
collection. For example, “Paradise” and “Hell” seem to
be specific entities in this genre of texts,
• multiword expressions in which each word is usually
written in letter case, but together the multiword
expression denotes a name and al least the first word
should be capitalized. For example, the expression
"Московский регион" (Moscow region) is normalized
incorrectly because the word "московский" is written
in letter case more frequently in the Russian news
collection.
Figure 2 Tweets after normalization
Table 4 TweetPerformance
Also the hashtag symbols were removed from a word
if this word was found in the news collection to improve Model F-measure, F-measure,
its matching with the lexicons. TestTweet- Normalized
Collection TestTweet-
4 Experiments Collection
1) Baseline 64.44% 69.88%
In prepocessing we remove mentioned user accounts
with “@” and urls in the end of tweets. We consider these 2) Baseline + 64.99% 70.32%
data as additional, as meta-information, from which we Collection statistics
should not extract names. (TestTweetCollection)
We train the described variants of our NER model on 3) Baseline + 65.78% 70.44%
the news collection. Table 3 shows results of named Collection statistics
entity recognition on the “Persons-1000” collection (FullTweetCollection)
(cross-validation 3:4). It can be seen that our baseline 4) Baseline + news 66.03% 70.88%
model is quite good on the news collection and slightly clusters
improved after adding clustering features and the two- 5) Baseline + tweet 66.08% 70.36%
step approach. In this case the collection statistics is clusters
obtained from the same “Persons-1000” collection. 6) Baseline + tweet 66.23% 70.89%
and news clusters
Table 3 News Collection NER Performance 7) (2) + tweet and 67.27% 71.20%
Model F-measure, % news clusters
Baseline 92.49 8) (3) + tweet and 66.46% 69.73%
Baseline+ News clusters 93.48 news clusters
Baseline+ News clusters + 93.53
Collection statistics 5 Conclusion
The paper describes an approach to creating a
Then we apply the trained model to the test tweet domain-specific tweet collection written by users
collection in initial capitalization and normalized frequently discussing Islam-related issues in Russian.
capitalization. Table 4 presents the performance of NER We use this collection to study specific features of named
models trained on the “Persons-1000” collection for the entity recognition on Twitter. We found that in contrast
tweet data. One can see that all models significantly to tweets collected randomly, our tweet collection
degrade on the tweet collection. contains relatively small number of spelling errors or
The normalization significantly improves the strange word shortenings. Specific difficulties of our
performance of NER (in contrast to other studies [7]). collection for named entity recognition include a large
Word clustering and the collection statistics improve number of Arabic and other Eastern names (persons,
both NER for initial and normalized text collections. locations, organizations) and frequent use of ALL-CAPS
Their impact is larger than for the news collection (Table writing for emphasizing main words in messages.
3). The combination of tweet and news clusters was We have studied the transfer of NER model trained
better than only tweet clusters possibly because of the on a newswire collection to the created tweet collection
political and religious character of the gathered and approaches to decrease the degradation of the model
collection. In total, the NER performance improves more because of the transfer. We found that for our specialized
than 10% on tweet data. text collection, the most improvement was based on
Analyzing mistakes of the best model on the normalizing of word capitalization. Two-stage
normalized collection we can see still significant share of approaches to named entity recognition and word2vec-
mistakes because of incorrectly normalized based clustering were also useful for our task.
capitalization. We can enumerate the following main In future we plan to improve techniques of tweet
subtypes of such problems: normalization and study NER for tweets of followers of
• ambiguous words with different capitalization (“Earth”, the selected users.
“Rose”),
250
References [12] Hovy, D.: Demographic Factors Improve
Classification Performance. In: ACL-2015, pp.
[1] Antonova, A.Y., Soloviev, A.N.: Conditional 752-762 (2015)
Random Field Models for the Processing of [13] Khoroshevsky, V.F.: Ontology Driven Multilingual
Russian. In: Int. Conf. “Dialog 2013”, pp. 27- 44.
Information Extraction and Intelligent Analytics.
RGGU (2013)
Web Intelligence and Security. pp. 237-262 (2010)
[2] Bocharov, V.V. et al.: “FactRuEval 2016:
[14] Kuznetsov, I.P., Kozerenko, E.B., Kuznetsov, K.I.,
Evaluation of Named Entity Recognition and Fact Timonina, N.O.: Intelligent System for Entities
Extraction Systems for Russian”. In: Dialog
Extraction (ISEE) from Natural Language Texts.
Conference. (2016)
In: Int. Workshop on Conceptual Structures for
[3] Brown, P.F., Della Pietra, V.J., Desouza, P.V., Extracting Natural Language Semantics-Sense, (9),
Lai, J.C., Mercer, R.L.: Class-based n-gram Models pp. 17-25 (2009)
of Natural Language. Computational Linguistics,
[15] Nadeau, D., Sekine, S.: A Survey of Named Entity
18 (4), pp. 467-479 (1992)
Recognition and Classification.
[4] Cherry, C., Guo, H.: The Unreasonable LingvisticaeInvestigationes, 30 (1), pp. 3-26 (2007)
Effectiveness of Word Representations for Twitter [16] Paris, C., Thomas, P., Wan, S.: Differences in
Named Entity Recognition. In: NAACL-2015. pp.
Language and Style Between Two Social Media
735-745 (2015)
Communities. In: the 6th AAAI Int. Conf. on
[5] Chrupala, G.: Efficient Induction of Probabilistic Weblogs and Social Media, ICWSM (2012)
Word Classes with LDA. In: 5th Int. Joint Conf. on [17] Podobryaev, A.V.: Persons Recognition Using
Natural Language Processing, IJCNLP 2011,
CRF Model. In: 15th All-Russian Scientific Conf.
pp. 363-372. Asian Federation of Natural Language
“Digital Libraries: Advanced Methods and
Processing (2011)
Technologies, Digital Collection”, RCDL-2013,
[6] Clark, A.: Combining Distributional and pp. 255-258. Demidov Yaroslavl State University
Morphological Information for part of Speech (2013)
Induction. In: 10th Conf. on European Chapter of
[18] Ratinov, L., Roth, D.: Design Challenges and
the Association for Computational Linguistics,
Misconceptions in Named Entity Recognition. In:
EACL, 1, pp. 59-66. ACL (2003)
13th Conf. on Computational Natural Language
[7] Derczynski, L., Maynard, D., Rizzo, G., van Erp, Learning, CoNLL, pp. 147-155. ACL (2009)
M., Gorrell, G., Troncy, R., Bontcheva, K.:
[19] Ritter, A., Clark, S., Mausam, Etzioni, O.: Named
Analysis of Named Entity Recognition and Linking
Entity Recognition in Tweets: An Experimental
for Tweets. Information Processing &
Study. In: EMNLP, pp.1524-1534 (2011)
Management, 51 (2), pp. 32-49 (2015)
[20] Ritter, A, Etzioni, O, Clark, S. et al: Open Domain
[8] Finin, T., Murnane, W., Karandikar, A, Keller, N.,
Event Extraction from Twitter. In: Conf. on
Martineau, J., Dredze, M.: Annotating Named
Knowledge Discovery and Data Mining, KDD,
Entities in Twitter Data with Crowdsourcing. In:
pp. 1104-1112 (2012)
the NAACL HLT 2010 Workshop on Creating
Speech and Language Data with Amazons [21] Cherry, C., Guo, H.: The Unreasonable
Mechanical Turk, pp. 80-88 (2010) Effectiveness of Word Representations for Twitter
Named Entity Recognition. In: NAACL-2015. pp.
[9] Fromreide, H., Hovy, D., Sogaard, A.
735-745 (2015)
Crowdsourcing and Annotating NER for Twitter
#drift. In LREC-2014, pp. 2544-2547 (2014) [22] Trofimov, I.V.: Person Name Recognition in News
Articles Based on the Persons-1000/1111-F
[10] Gareev, R., Tkachenko, M., Solovyev, V.,
Collections. In: 16th All-Russian Scientific Conf.
Simanovsky, A., Ivanov, V.: Introducing Baselines
“Digital Libraries: Advanced Methods and
for Russian Named Entity Recognition. In: 14th Int.
Technologies, Digital Collections”, RCDL 2014,
Conf. CICLing 2013, pp. 329-342. Springer (2013)
pp. 217-221 (2014)
[11] Hidayatullah, A.F.: Language Tweet
[23] Yang, Y., Eisenstein, J.: Putting Things in Context:
Characteristics of Indonesian Citizens. In: Int.
Community-specific Embedding Projections for
Conf. IEEE-2015. pp. 397-401 (2015)
Sentiment Analysis. arXiv preprint
arXiv:1511.06052 (2015)
251