=Paper=
{{Paper
|id=Vol-2022/paper40
|storemode=property
|title=
Recognizing Names in Islam-Related Russian Twitter
|pdfUrl=https://ceur-ws.org/Vol-2022/paper40.pdf
|volume=Vol-2022
|authors=Valerie Mozharova,Natalia Loukachevitch
|dblpUrl=https://dblp.org/rec/conf/rcdl/MozharovaL17
}}
==
Recognizing Names in Islam-Related Russian Twitter
==
       Recognizing Names in Islam-Related Russian Twitter
                       © V. Mozharova                                          © N. Loukachevitch
                                     Lomonosov Moscow State University,
                                             Moscow, Russia
                     valerie.mozharova@gmail.com                                   louk_nat@mail.ru
          Abstract. The paper describes an approach to creating a domain-specific tweet collection written by
    users frequently discussing Islam-related issues in Russian. We use this collection to study specific features
    of named entity recognition on Twitter. We found that in contrast to tweets collected randomly, our tweet
    collection contains relatively small number of spelling errors or strange word shortenings. Specific difficulties
    of our collection for named entity recognition include a large number of Arabic and other Eastern names and
    frequent use of ALL-CAPS spelling for emphasizing main words in messages. We studied the transfer of
    NER model trained on a news wire collection to the created tweet collection and approaches to decrease the
    degradation of the model because of the transfer. We found that for our specialized text collection, the most
    improvement was based on normalizing of word capitalization. Two-stage approaches to named entity
    recognition and Word2vec-based clustering were also useful for our task.
          Keywords: NER, CRF, Twitter.
                                                                     of Islam-oriented and other similar communities. We
1 Introduction                                                       consider the transfer of CRF-based NER recognizer from
    Named entity recognition (NER) is one of the basic               a news data to the tweet collection and approaches to
natural language processing tasks [17, 20]. Recognition              decrease the degradation of the model because of the
of named entities in texts is used in many other                     transfer.
information-processing tasks as relation extraction,
                                                                     2 Related works
entity linking, information retrieval etc. Most studies of
NER have been carried out on news collections and                    2.1 Named Entity Recognition for Twitter
shown high quality of named entity extraction. However,
the transfer of NER recognizers to other genres of texts                It is known that extraction of names from Twitter
demonstrated significant decrease in the performance.                messages is much more difficult task than from other
                                                                     genres of text because of their shortness and informal
    Currently, there is a great interest in information
                                                                     character.
extraction from texts published on social media
platforms such as Twitter or Facebook because these                      In [7] the authors review the problems and
platforms can serve as a very useful (fast and/or                    approaches to named entity recognition and entity
alternative) source of information [22]. But application             linking for tweets. They write that the tweet content is
of general NER recognizers designed for or trained on                noisy because of incorrect spelling, irregular
news collections can demonstrate the decrease in                     capitalization, and unusual abbreviations. In their
performance of up to 50% on more in these informal                   experiments, the main sources of mistakes in named
texts [4, 7–9].                                                      entity recognition in tweets were violations in
    Another important direction of social network studies            capitalization especially a large number of names written
is directed to differences of language and style in specific         in lower case. They studied automatic normalization of
social media communities [11, 18] or their dependence                tweets including spelling and capitalization correction
on social and demographic characteristics of users [12,              and reported that in their investigation the normalization
21].                                                                 slightly improved the performance in NER for tweets.
    In this paper, we consider the transfer of Russian                   In [24] the authors write that due to unreliable
NER recognizer trained on news texts to extracting                   capitalization in tweets, common nouns are often
names from Twitter messages. Our tweet collection is                 misclassified as proper nouns, and vice versa. Some
specialized; it is gathered from messages of those users             tweets contain all lowercase words (8%), whereas others
who discuss issues related to Islam in their posts in                are in ALL CAPS (0.6%). In addition to differences in
contrast to other studies where Twitter collections are              vocabulary, the grammar of tweets differs from news
formed with random sampling of Twitter messages. This                text, for example, tweets often start with a verb. In their
allows us to reveal specific features of the tweet language          experiments, the supervised approach was used to predict
                                                                     correct capitalization of words. The set of features
Proceedings of the XIX International Conference                      included: the fraction of words in the tweet which are
“Data Analytics and Management in Data Intensive                     capitalized, the fraction which appear in a dictionary of
Domains” (DAMDID/RCDL’2017), Moscow, Russia,                         frequently lowercase/capitalized words but are not
October 10–13, 2017
                                                               246
lowercase/capitalized in the tweet, the number of times            clusters [3], Clark clusters [6]). They achieved 75.05%
the word ‘I’ appears lowercase and whether or not the              of F-score on two named entity types: persons (84.84%)
first word in the tweet is capitalized.                            and organizations (71.31%).
    To study NER on Twitter performed with several                     In 2016 the FactRuEval competition for the Russian
NER systems, [8] use crowdsourcing to annotate tweets              language was organized. The FactRuEvaltasks included
for the NER task. They annotate all @user-names as PER             recognition of names in Russian news texts, recognition
(person name). Annotating tweets for their experiments             of specific attributes of names (family name, first name,
[24] choose not to annotate @usernames mentioned in                etc), and extraction of several types of facts [2].
tweets as entities because it is trivial to identify them              So far, named entity recognition in tweets did not
using a simple regular expression, and they would only             have studied for Russian. Also, the dependence of NER
serve to inate the performance statistics.                         performance on the language of specific Twitter user
    In [4] the authors study the transfer of their NER             communities has not been studied before.
model from news texts to tweets. They create a training
set consisting of 1000 tweets. They use a baseline NER             3 Text collections
model based on token and context features (wordform,
                                                                   3.1 News Text Collection
lemma, capitalization, prefixes and suffixes) and
enhance it with two unsupervised representations                       We study the transfer of CRF-based NER classifier
(Brown clusters and vector representations) based on a             trained on newswire data to the tweet collection. For
large collection of unannotated tweets. Besides, they              training our system, we chose open Russian text
propose a technique to combine a relatively small Twitter          collection "Persons-1000", which contains 1000 news
training set and larger newswire training data. They               documents labeled with three types of named entities:
report that two unsupervised representations work                  persons, organizations and locations2. The labeling rules
together better than alone, and the combination of                 are detailed in [16]. The counts of each named entity type
training sets further improves the performance of their            in the collection are listed in Table 1.
NER system.
                                                                   Table 1 The quantitative characteristics of the labeled
2.2 Named Entity Recognition in Russian
                                                                   named entities in text collections
   In Russian there is a long tradition of engineering              Type           News collection Twitter collection
approaches to the named entity recognition task [13, 14,            PER                 10623               1546
23].                                                                ORG                  8541               1144
    Machine-learning approaches for Russian NER                     LOC                  7244               2836
usually employ the CRF machine learning method. In [1]              OVERALL             26408               5526
the authors presented the results of the CRF-based
method on various tasks, including the named entity                3.2 Tweet Text Collection
recognition. The experiments were carried out on their
                                                                       We are interested in study of the language of Islam-
own Russian text corpus, which contained 71,000
                                                                   related Twitter users in Russian. To extract tweets from
sentences. They used only n-grams and orthographic
                                                                   users discussing Islam-related issues, we created a list of
features of tokens without utilizing any knowledge-based
                                                                   2700 Islam terms. Then we extracted Russian tweets
features. They achieved 89.89% of F-score on three
                                                                   mentioning these terms using Search Twitter API, got
named entity types: names (93.15%), geographical
                                                                   users' accounts containing extracted tweets and ordered
objects (92.7%), and organizations (83.83%).
                                                                   the accounts in the decreased number of extracted tweets
    In [19] the experiments utilized the open Russian text         from these accounts. We found that a lot of words from
collection “Persons-600”1 for the person name                      our list practically are not mentioned in tweets, other
recognition task. The CRF-based classifier employed                words (for example, “mosque” or “Muslim”) are often
such features as token features, context features, and the         used by very different people, not only Muslims.
features based on knowledge about persons (roles,                      After studying tweets from extracted accounts we
professions, posts, and other). They achieved 88.32% of            created a very small list of the main Islam words
F-score on person names.                                           (“Allah”, “Quran”, “Prophet”, in various forms of
    In [10] the experiments were carried out on the                Russian morphology). We also added the names of
Russian text collection, which contained 97 documents.             several known Islamist organizations to find their
The authors used two approaches for the named entity               possible non-Muslim proponents. Then we repeated the
recognition: knowledge-based and CRF-based approach.               whole procedure of tweet and account extraction, and
In the machine learning framework they utilized such               found that the extracted collections can be considered as
features as the token features and the knowledge features          an appropriate approximation of messages generated by
based on word clustering (LDA topics [17], Brown                   Islam-related users.
1                                                                  2
 http://ai-center.botik.ru/Airec/index.php?option                      http://labinform.ru/pub/named_entities/descr_ne.htm
=com_content&view=article&id=27:persons-
600&catid=15&Itemid=40
                                                             247
    We selected 100 users with the largest number of the            4.1.1 Token features
extracted tweets, downloaded all their tweets and
obtained tweet collection consisting of 300 thousand                The token features include:
tweets (further FullTweetCollection). Then we randomly                  •    Token initial form (lemma);
extracted tweets from different users, removed non-
Russian or senseless tweets and at last obtained the tweet              •    Number of symbols in a token;
collection of 4192 tweets (further TestTweetCollection).                •    Letter case. If a token begins with a capital
The created collection contains messages with Quran                          letter, and other letters are small then the value
quotes, religious and political argumentation, news-                         of this feature is “Big Small”. If all letters are
related messages mainly about Near and Middle East                           capital then the value is “Big Big”. If all letters
events (Syria, Iraq, Afghanistan etc) and Islamist                           are small then the value is “Small Small”. In
organizations (Syrian opposition groups, ISIL, etc.) and                     other cases the value is “Fence”;
also other types of messages (for example,                              •    Token type. The value of this feature for
advertisements).                                                             lexemes is the part of speech, for punctuation
    The obtained collection was labeled similar to                           marks the value is the type of punctuation;
“Persons-1000”. To annotate numerous mentions of                        •    Symbol n-grams. The presence of prefixes,
Allah, we added the Deity type to the annotation scheme,                     suffixes and other n-grams from the predefined
but in the current study we consider the Deity type as a                     sets in a token.
subtype of the Person type.
    Analyzing the created collection from the point of              4.1.2 Context-based features
view of NER difficulties we found that violations in                    The group of context features includes two feature
capitalization mainly include all-caps words for the                types. The first type is local context features. It takes into
whole tweet and its fragment. Such capitalization is used           account all mentioned token feature values of nearby
for emphasizing important words in the text or words                words in two-word window to the right and to the left
related to Allah as in the following example: За все                from the current word.
потери ОН дает нам большую награду" (“For all the
                                                                        The second type is the bigram context feature. It
losses He gives us a great reward”). Also the tweets                contains information about the determined named entity
mention a lot of Eastern names of persons, organizations
                                                                    type of the previous word. It helps to find named entity
(“Фастаким Кама умирт” (Fastakim Kama Umirt
                                                                    borders more precisely. For example, if the person
group), Джабхатфатхаш-Шам (Jabhat Fateh al-Sham)),                  second name is difficult for recognition, the presence of
or local places difficult for correct recognition.                  the first name before this word makes the classification
    The fraction of tweets with spelling mistakes,                  easier.
unusual shortenings is relatively low. We suppose that
this is because the selected users are well-educated, they          4.1.3 Features based on lexicons
are professional writers in some sense, in most cases they
                                                                        To improve the quality of recognition, we added
are leaders of opinions, whose messages are retweeted by
                                                                    special lexicons with lists of useful objects. An object
many other people. Therefore it is especially useful to
                                                                    can be a word or a phrase. The lexicons had been created
study the specific features of their tweet language.
                                                                    before the current work and were not changed during the
4 Description of NER Model                                          study.
                                                                        To calculate the lexicon features, the system matches
    In our study, we employ the baseline CRF-classifier
                                                                    the text and lexicon entries. If a token is met in a matched
that utilizes token features, context features, and lexicon
                                                                    lexicon entry then it obtains the lexicon feature value
features for NER. Then we consider the ways to improve
                                                                    equal to the length of the found entry. The use of the
the baseline model adapting it to the Twitter language.
                                                                    entry length as a feature helps to diminish the affect of
The adaptation techniques include the use of two stage-
                                                                    lexical ambiguity. For example, in the list of
processing and unsupervised word clustering. Besides,
                                                                    organizations there is “Apple” as the name of a company.
we test the impact of tweet normalization on the NER
                                                                    But this word does not necessarily mean a company
performance.
                                                                    because it has the second sense of a fruit. In the opposite,
   4.1 Baseline model                                               if we found in the text the phrase “Lomonosov Moscow
                                                                    State University”, which is also included in the
    Before named entity recognition with CRF, tweets
                                                                    organization lexicon, the probability of the organization
are processed with a morphological analyzer for
                                                                    sense is higher than in the first case. The lexicon feature
determining the part of speech, gender, lemma,
                                                                    containing the matched entry length helps the system to
grammatical number, case and characteristics of words.
                                                                    distinguish these two cases.
This information is used to form features of each word
for classifying. In the baseline model we consider three                The biggest lexicons are listed in Table 2. The overall
types of features: local token features, context features,          size of all vocabularies is more than 335 thousand
and features based on lexicons.                                     entities. These lexicons were collected from from several
                                                                    sources: phonebooks, Russian Wikipedia, RuThes
                                                                    thesaurus [15].
                                                              248
                                                                      together with old ones participate in final classification.
Table 2 Vocabularysizes                                               These statistics can be collected from the current text (the
 Vocabulary        Size, objects        Clarification                 whole text or its part preceding to the word analysis) or
 Famous persons        31482            Famous people                 from a large text collection (collection statistics). In case
 First names            2773            First names                   of tweet processing, texts are small therefore only the
 Surnames              66108            Surnames                      collection statistics can be used. In our experiments this
                                                                      statistics can be obtained from the FullTweetCollection
 Person roles           9935            Roles, posts
                                                                      gathered from the selected user accounts or the labeled
 Verbs of               1729            Verbs that usually
                                                                      TestTweetCollection as described in Section 3.2.
 informing                              occur
                                                                          For each word, the system finds all mentions of this
                                        with persons
                                                                      word in the processed collection and counts frequencies
    Companies               33380       Organization                  of determined named entity types for this word. Using
                                        names                         these frequencies for each entity type, the system creates
    Company types            6774       Oranization types             additional features, which have one of three values:
    Media                    3909       Media                         no_one (if the word has not been recognized as a named
                                                                      entity of the chosen type), best (if the word has been as
    Geography                8969       Geographical
                                                                      signed to the chosen named entity type more than in 50%
                                        objects
                                                                      of cases), and rare (if the word has been assigned to the
    Geographical             1739       Geographical
                                                                      chosen named entity type less than in 50% of cases).
    adjectives                          adjectives
    Usual words             58432       Frequent Russian                  For example, if the word “Russia” was met 500 times
                                        words                         in a collection, and the classifier assigned it 200 times to
                                        (nouns, verbs,                organizations and 300 times to locations, then the values
                                        adjectives)                   of the global statistics feature for the word “Russia” will
    Equipment               44094       Devices,                      be as following: PER –no_one, ORG – rare, LOC – best.
                                        equipment, tools              4.2 Normalization of Word Capitalization
4.2 Adaptation of NER Model to Tweets                                     As we found that in our tweet collection the share of
                                                                      misprints is not very high we did normalization only for
4.2.1 Unsupervised word clustering                                    word capitalization. The normalization was based on the
    In previous studies it was shown that unsupervised                large news collection described in Section 4.2. For each
word clustering on the basis of a large text collection               word in this collection, we counted how many times the
improves the NER performance. In our case we compare                  word was written in letter case or capital case when it
the impact of word clusters calculated on a large news                stands not in the beginning of a sentence. The more
collection and large tweet collection. For clustering we              frequent case was considered as normal for this word.
use the Word2vec package1. It represents words with                       We considered the normalization in two variants:
distributional vectors (word embeddings), computed                            • Variant A. All words in a tweet, except the
with the neural network. The semantic similarity                                   first one, are changed to a normal form of
between two words is calculated as the cosine measure                              capitalization;
between two corresponding vectors. The package allows                         •    Variant B. All words in a tweet including the
automatic clustering of words according to their                                   first one are changed to a normal form of
calculated similarity. We used the c-bow model with                                capitalization.
vector sizes equal to 300. Thus, each word has an
additional feature – the number of a cluster in that it
appears. The news collection utilized for clustering
contains two million news documents. For tweet-based
clustering we use a tweet collection consisting of
randomly extracted Russian tweets and including 8.3
million tweets.
4.1.1 Two-stage prediction
    We suppose that for adapting a classifier to a text
collection it can be useful to take into account the entities
already labeled by the classifier and to memorize the                 Figure 1 Tweets before normalization
named entity type statistics for future use.                             We found that the variant B produces better results
    On the first stage the classifier extracts named                  and later experimented only with this variant.
entities. Then the system collects the class statistics                  Fig. 1 presents several tweets with the manual
determined in the first stage for each word and used it for           annotation before normalization. Fig. 2 shows the same
features of the second stage. After that, new features                tweets after normalization.
1
    https://github.com/dav/word2vec
                                                                249
                                                                    • words that should be capitalized in this specific
                                                                      collection. For example, “Paradise” and “Hell” seem to
                                                                      be specific entities in this genre of texts,
                                                                    • multiword expressions in which each word is usually
                                                                      written in letter case, but together the multiword
                                                                      expression denotes a name and al least the first word
                                                                      should be capitalized. For example, the expression
                                                                      "Московский регион" (Moscow region) is normalized
                                                                      incorrectly because the word "московский" is written
                                                                      in letter case more frequently in the Russian news
                                                                      collection.
Figure 2 Tweets after normalization
                                                                    Table 4 TweetPerformance
     Also the hashtag symbols were removed from a word
if this word was found in the news collection to improve             Model                   F-measure,         F-measure,
its matching with the lexicons.                                                              TestTweet-         Normalized
                                                                                             Collection         TestTweet-
4 Experiments                                                                                                   Collection
                                                                     1) Baseline                  64.44%         69.88%
    In prepocessing we remove mentioned user accounts
with “@” and urls in the end of tweets. We consider these            2) Baseline +                64.99%          70.32%
data as additional, as meta-information, from which we               Collection statistics
should not extract names.                                            (TestTweetCollection)
    We train the described variants of our NER model on              3) Baseline +                65.78%          70.44%
the news collection. Table 3 shows results of named                  Collection statistics
entity recognition on the “Persons-1000” collection                  (FullTweetCollection)
(cross-validation 3:4). It can be seen that our baseline             4) Baseline + news           66.03%          70.88%
model is quite good on the news collection and slightly              clusters
improved after adding clustering features and the two-               5) Baseline + tweet          66.08%          70.36%
step approach. In this case the collection statistics is             clusters
obtained from the same “Persons-1000” collection.                    6) Baseline + tweet          66.23%          70.89%
                                                                     and news clusters
  Table 3 News Collection NER Performance                            7) (2) + tweet and           67.27%          71.20%
 Model                              F-measure, %                     news clusters
 Baseline                              92.49                         8) (3) + tweet and           66.46%          69.73%
 Baseline+ News clusters               93.48                         news clusters
 Baseline+     News    clusters +      93.53
 Collection statistics                                              5 Conclusion
                                                                        The paper describes an approach to creating a
    Then we apply the trained model to the test tweet               domain-specific tweet collection written by users
collection in initial capitalization and normalized                 frequently discussing Islam-related issues in Russian.
capitalization. Table 4 presents the performance of NER             We use this collection to study specific features of named
models trained on the “Persons-1000” collection for the             entity recognition on Twitter. We found that in contrast
tweet data. One can see that all models significantly               to tweets collected randomly, our tweet collection
degrade on the tweet collection.                                    contains relatively small number of spelling errors or
    The normalization significantly improves the                    strange word shortenings. Specific difficulties of our
performance of NER (in contrast to other studies [7]).              collection for named entity recognition include a large
Word clustering and the collection statistics improve               number of Arabic and other Eastern names (persons,
both NER for initial and normalized text collections.               locations, organizations) and frequent use of ALL-CAPS
Their impact is larger than for the news collection (Table          writing for emphasizing main words in messages.
3). The combination of tweet and news clusters was                      We have studied the transfer of NER model trained
better than only tweet clusters possibly because of the             on a newswire collection to the created tweet collection
political and religious character of the gathered                   and approaches to decrease the degradation of the model
collection. In total, the NER performance improves more             because of the transfer. We found that for our specialized
than 10% on tweet data.                                             text collection, the most improvement was based on
    Analyzing mistakes of the best model on the                     normalizing of word capitalization. Two-stage
normalized collection we can see still significant share of         approaches to named entity recognition and word2vec-
mistakes      because      of   incorrectly    normalized           based clustering were also useful for our task.
capitalization. We can enumerate the following main                     In future we plan to improve techniques of tweet
subtypes of such problems:                                          normalization and study NER for tweets of followers of
• ambiguous words with different capitalization (“Earth”,           the selected users.
  “Rose”),
                                                              250
References                                                       [12] Hovy,     D.: Demographic Factors Improve
                                                                      Classification Performance. In: ACL-2015, pp.
 [1] Antonova, A.Y., Soloviev, A.N.: Conditional                      752-762 (2015)
     Random Field Models for the Processing of                   [13] Khoroshevsky, V.F.: Ontology Driven Multilingual
     Russian. In: Int. Conf. “Dialog 2013”, pp. 27- 44.
                                                                      Information Extraction and Intelligent Analytics.
     RGGU (2013)
                                                                      Web Intelligence and Security. pp. 237-262 (2010)
 [2] Bocharov, V.V. et al.: “FactRuEval 2016:
                                                                 [14] Kuznetsov, I.P., Kozerenko, E.B., Kuznetsov, K.I.,
     Evaluation of Named Entity Recognition and Fact                  Timonina, N.O.: Intelligent System for Entities
     Extraction Systems for Russian”. In: Dialog
                                                                      Extraction (ISEE) from Natural Language Texts.
     Conference. (2016)
                                                                      In: Int. Workshop on Conceptual Structures for
 [3] Brown, P.F., Della Pietra, V.J., Desouza, P.V.,                  Extracting Natural Language Semantics-Sense, (9),
     Lai, J.C., Mercer, R.L.: Class-based n-gram Models               pp. 17-25 (2009)
     of Natural Language. Computational Linguistics,
                                                                 [15] Nadeau, D., Sekine, S.: A Survey of Named Entity
     18 (4), pp. 467-479 (1992)
                                                                      Recognition             and             Classification.
 [4] Cherry, C., Guo, H.: The Unreasonable                            LingvisticaeInvestigationes, 30 (1), pp. 3-26 (2007)
     Effectiveness of Word Representations for Twitter           [16] Paris, C., Thomas, P., Wan, S.: Differences in
     Named Entity Recognition. In: NAACL-2015. pp.
                                                                      Language and Style Between Two Social Media
     735-745 (2015)
                                                                      Communities. In: the 6th AAAI Int. Conf. on
 [5] Chrupala, G.: Efficient Induction of Probabilistic               Weblogs and Social Media, ICWSM (2012)
     Word Classes with LDA. In: 5th Int. Joint Conf. on          [17] Podobryaev, A.V.: Persons Recognition Using
     Natural Language Processing, IJCNLP 2011,
                                                                      CRF Model. In: 15th All-Russian Scientific Conf.
     pp. 363-372. Asian Federation of Natural Language
                                                                      “Digital Libraries: Advanced Methods and
     Processing (2011)
                                                                      Technologies, Digital Collection”, RCDL-2013,
 [6] Clark, A.: Combining Distributional and                          pp. 255-258. Demidov Yaroslavl State University
     Morphological Information for part of Speech                     (2013)
     Induction. In: 10th Conf. on European Chapter of
                                                                 [18] Ratinov, L., Roth, D.: Design Challenges and
     the Association for Computational Linguistics,
                                                                      Misconceptions in Named Entity Recognition. In:
     EACL, 1, pp. 59-66. ACL (2003)
                                                                      13th Conf. on Computational Natural Language
 [7] Derczynski, L., Maynard, D., Rizzo, G., van Erp,                 Learning, CoNLL, pp. 147-155. ACL (2009)
     M., Gorrell, G., Troncy, R., Bontcheva, K.:
                                                                 [19] Ritter, A., Clark, S., Mausam, Etzioni, O.: Named
     Analysis of Named Entity Recognition and Linking
                                                                      Entity Recognition in Tweets: An Experimental
     for     Tweets.    Information     Processing    &
                                                                      Study. In: EMNLP, pp.1524-1534 (2011)
     Management, 51 (2), pp. 32-49 (2015)
                                                                 [20] Ritter, A, Etzioni, O, Clark, S. et al: Open Domain
 [8] Finin, T., Murnane, W., Karandikar, A, Keller, N.,
                                                                      Event Extraction from Twitter. In: Conf. on
     Martineau, J., Dredze, M.: Annotating Named
                                                                      Knowledge Discovery and Data Mining, KDD,
     Entities in Twitter Data with Crowdsourcing. In:
                                                                      pp. 1104-1112 (2012)
     the NAACL HLT 2010 Workshop on Creating
     Speech and Language Data with Amazons                       [21] Cherry, C., Guo, H.: The Unreasonable
     Mechanical Turk, pp. 80-88 (2010)                                Effectiveness of Word Representations for Twitter
                                                                      Named Entity Recognition. In: NAACL-2015. pp.
 [9] Fromreide, H., Hovy, D., Sogaard, A.
                                                                      735-745 (2015)
     Crowdsourcing and Annotating NER for Twitter
     #drift. In LREC-2014, pp. 2544-2547 (2014)                  [22] Trofimov, I.V.: Person Name Recognition in News
                                                                      Articles Based on the Persons-1000/1111-F
[10] Gareev, R., Tkachenko, M., Solovyev, V.,
                                                                      Collections. In: 16th All-Russian Scientific Conf.
     Simanovsky, A., Ivanov, V.: Introducing Baselines
                                                                      “Digital Libraries: Advanced Methods and
     for Russian Named Entity Recognition. In: 14th Int.
                                                                      Technologies, Digital Collections”, RCDL 2014,
     Conf. CICLing 2013, pp. 329-342. Springer (2013)
                                                                      pp. 217-221 (2014)
[11] Hidayatullah,       A.F.:      Language      Tweet
                                                                 [23] Yang, Y., Eisenstein, J.: Putting Things in Context:
     Characteristics of Indonesian Citizens. In: Int.
                                                                      Community-specific Embedding Projections for
     Conf. IEEE-2015. pp. 397-401 (2015)
                                                                      Sentiment        Analysis.       arXiv        preprint
                                                                      arXiv:1511.06052 (2015)
                                                           251