<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recognizing Names in Islam-Related Russian Twitter</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>© V. Mozharova</string-name>
          <email>valerie.mozharova@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lomonosov Moscow State University</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Proceedings of the XIX International Conference “Data Analytics and Management in Data Intensive Domains” (DAMDID/RCDL'2017)</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>246</fpage>
      <lpage>251</lpage>
      <abstract>
        <p>The paper describes an approach to creating a domain-specific tweet collection written by users frequently discussing Islam-related issues in Russian. We use this collection to study specific features of named entity recognition on Twitter. We found that in contrast to tweets collected randomly, our tweet collection contains relatively small number of spelling errors or strange word shortenings. Specific difficulties of our collection for named entity recognition include a large number of Arabic and other Eastern names and frequent use of ALL-CAPS spelling for emphasizing main words in messages. We studied the transfer of NER model trained on a news wire collection to the created tweet collection and approaches to decrease the degradation of the model because of the transfer. We found that for our specialized text collection, the most improvement was based on normalizing of word capitalization. Two-stage approaches to named entity recognition and Word2vec-based clustering were also useful for our task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Named entity recognition (NER) is one of the basic
natural language processing tasks [
        <xref ref-type="bibr" rid="ref17 ref20">17, 20</xref>
        ]. Recognition
of named entities in texts is used in many other
information-processing tasks as relation extraction,
entity linking, information retrieval etc. Most studies of
NER have been carried out on news collections and
shown high quality of named entity extraction. However,
the transfer of NER recognizers to other genres of texts
demonstrated significant decrease in the performance.
      </p>
      <p>
        Currently, there is a great interest in information
extraction from texts published on social media
platforms such as Twitter or Facebook because these
platforms can serve as a very useful (fast and/or
alternative) source of information [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. But application
of general NER recognizers designed for or trained on
news collections can demonstrate the decrease in
performance of up to 50% on more in these informal
texts [
        <xref ref-type="bibr" rid="ref4 ref7 ref8 ref9">4, 7–9</xref>
        ].
      </p>
      <p>
        Another important direction of social network studies
is directed to differences of language and style in specific
social media communities [
        <xref ref-type="bibr" rid="ref11 ref18">11, 18</xref>
        ] or their dependence
on social and demographic characteristics of users [
        <xref ref-type="bibr" rid="ref12 ref21">12,
21</xref>
        ].
      </p>
      <p>In this paper, we consider the transfer of Russian
NER recognizer trained on news texts to extracting
names from Twitter messages. Our tweet collection is
specialized; it is gathered from messages of those users
who discuss issues related to Islam in their posts in
contrast to other studies where Twitter collections are
formed with random sampling of Twitter messages. This
allows us to reveal specific features of the tweet language
of Islam-oriented and other similar communities. We
consider the transfer of CRF-based NER recognizer from
a news data to the tweet collection and approaches to
decrease the degradation of the model because of the
transfer.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Related works</title>
      <sec id="sec-2-1">
        <title>2.1 Named Entity Recognition for Twitter</title>
        <p>It is known that extraction of names from Twitter
messages is much more difficult task than from other
genres of text because of their shortness and informal
character.</p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] the authors review the problems and
approaches to named entity recognition and entity
linking for tweets. They write that the tweet content is
noisy because of incorrect spelling, irregular
capitalization, and unusual abbreviations. In their
experiments, the main sources of mistakes in named
entity recognition in tweets were violations in
capitalization especially a large number of names written
in lower case. They studied automatic normalization of
tweets including spelling and capitalization correction
and reported that in their investigation the normalization
slightly improved the performance in NER for tweets.
        </p>
        <p>In [24] the authors write that due to unreliable
capitalization in tweets, common nouns are often
misclassified as proper nouns, and vice versa. Some
tweets contain all lowercase words (8%), whereas others
are in ALL CAPS (0.6%). In addition to differences in
vocabulary, the grammar of tweets differs from news
text, for example, tweets often start with a verb. In their
experiments, the supervised approach was used to predict
correct capitalization of words. The set of features
included: the fraction of words in the tweet which are
capitalized, the fraction which appear in a dictionary of
frequently lowercase/capitalized words but are not
lowercase/capitalized in the tweet, the number of times
the word ‘I’ appears lowercase and whether or not the
first word in the tweet is capitalized.</p>
        <p>
          To study NER on Twitter performed with several
NER systems, [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] use crowdsourcing to annotate tweets
for the NER task. They annotate all @user-names as PER
(person name). Annotating tweets for their experiments
[24] choose not to annotate @usernames mentioned in
tweets as entities because it is trivial to identify them
using a simple regular expression, and they would only
serve to inate the performance statistics.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] the authors study the transfer of their NER
model from news texts to tweets. They create a training
set consisting of 1000 tweets. They use a baseline NER
model based on token and context features (wordform,
lemma, capitalization, prefixes and suffixes) and
enhance it with two unsupervised representations
(Brown clusters and vector representations) based on a
large collection of unannotated tweets. Besides, they
propose a technique to combine a relatively small Twitter
training set and larger newswire training data. They
report that two unsupervised representations work
together better than alone, and the combination of
training sets further improves the performance of their
NER system.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Named Entity Recognition in Russian</title>
        <p>
          In Russian there is a long tradition of engineering
approaches to the named entity recognition task [
          <xref ref-type="bibr" rid="ref13 ref14 ref23">13, 14,
23</xref>
          ].
        </p>
        <p>
          Machine-learning approaches for Russian NER
usually employ the CRF machine learning method. In [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
the authors presented the results of the CRF-based
method on various tasks, including the named entity
recognition. The experiments were carried out on their
own Russian text corpus, which contained 71,000
sentences. They used only n-grams and orthographic
features of tokens without utilizing any knowledge-based
features. They achieved 89.89% of F-score on three
named entity types: names (93.15%), geographical
objects (92.7%), and organizations (83.83%).
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] the experiments utilized the open Russian text
collection “Persons-600”1 for the person name
recognition task. The CRF-based classifier employed
such features as token features, context features, and the
features based on knowledge about persons (roles,
professions, posts, and other). They achieved 88.32% of
F-score on person names.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] the experiments were carried out on the
Russian text collection, which contained 97 documents.
The authors used two approaches for the named entity
recognition: knowledge-based and CRF-based approach.
In the machine learning framework they utilized such
features as the token features and the knowledge features
based on word clustering (LDA topics [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], Brown
1http://ai-center.botik.ru/Airec/index.php?option
=com_content&amp;amp;view=article&amp;amp;id=27:persons600&amp;amp;catid=15&amp;amp;Itemid=40
clusters [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], Clark clusters [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]). They achieved 75.05%
of F-score on two named entity types: persons (84.84%)
and organizations (71.31%).
        </p>
        <p>
          In 2016 the FactRuEval competition for the Russian
language was organized. The FactRuEvaltasks included
recognition of names in Russian news texts, recognition
of specific attributes of names (family name, first name,
etc), and extraction of several types of facts [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>So far, named entity recognition in tweets did not
have studied for Russian. Also, the dependence of NER
performance on the language of specific Twitter user
communities has not been studied before.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Text collections</title>
      <sec id="sec-3-1">
        <title>3.1 News Text Collection</title>
        <p>
          We study the transfer of CRF-based NER classifier
trained on newswire data to the tweet collection. For
training our system, we chose open Russian text
collection "Persons-1000", which contains 1000 news
documents labeled with three types of named entities:
persons, organizations and locations2. The labeling rules
are detailed in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. The counts of each named entity type
in the collection are listed in Table 1.
        </p>
        <p>We are interested in study of the language of
Islamrelated Twitter users in Russian. To extract tweets from
users discussing Islam-related issues, we created a list of
2700 Islam terms. Then we extracted Russian tweets
mentioning these terms using Search Twitter API, got
users' accounts containing extracted tweets and ordered
the accounts in the decreased number of extracted tweets
from these accounts. We found that a lot of words from
our list practically are not mentioned in tweets, other
words (for example, “mosque” or “Muslim”) are often
used by very different people, not only Muslims.</p>
        <p>After studying tweets from extracted accounts we
created a very small list of the main Islam words
(“Allah”, “Quran”, “Prophet”, in various forms of
Russian morphology). We also added the names of
several known Islamist organizations to find their
possible non-Muslim proponents. Then we repeated the
whole procedure of tweet and account extraction, and
found that the extracted collections can be considered as
an appropriate approximation of messages generated by
Islam-related users.
2http://labinform.ru/pub/named_entities/descr_ne.htm
We selected 100 users with the largest number of the
extracted tweets, downloaded all their tweets and
obtained tweet collection consisting of 300 thousand
tweets (further FullTweetCollection). Then we randomly
extracted tweets from different users, removed
nonRussian or senseless tweets and at last obtained the tweet
collection of 4192 tweets (further TestTweetCollection).
The created collection contains messages with Quran
quotes, religious and political argumentation,
newsrelated messages mainly about Near and Middle East
events (Syria, Iraq, Afghanistan etc) and Islamist
organizations (Syrian opposition groups, ISIL, etc.) and
also other types of messages (for example,
advertisements).</p>
        <p>The obtained collection was labeled similar to
“Persons-1000”. To annotate numerous mentions of
Allah, we added the Deity type to the annotation scheme,
but in the current study we consider the Deity type as a
subtype of the Person type.</p>
        <p>Analyzing the created collection from the point of
view of NER difficulties we found that violations in
capitalization mainly include all-caps words for the
whole tweet and its fragment. Such capitalization is used
for emphasizing important words in the text or words
related to Allah as in the following example: За все
потери ОН дает нам большую награду" (“For all the
losses He gives us a great reward”). Also the tweets
mention a lot of Eastern names of persons, organizations
(“Фастаким Кама умирт” (Fastakim Kama Umirt
group), Джабхатфатхаш-Шам (Jabhat Fateh al-Sham)),
or local places difficult for correct recognition.</p>
        <p>The fraction of tweets with spelling mistakes,
unusual shortenings is relatively low. We suppose that
this is because the selected users are well-educated, they
are professional writers in some sense, in most cases they
are leaders of opinions, whose messages are retweeted by
many other people. Therefore it is especially useful to
study the specific features of their tweet language.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Description of NER Model</title>
      <p>In our study, we employ the baseline CRF-classifier
that utilizes token features, context features, and lexicon
features for NER. Then we consider the ways to improve
the baseline model adapting it to the Twitter language.
The adaptation techniques include the use of two
stageprocessing and unsupervised word clustering. Besides,
we test the impact of tweet normalization on the NER
performance.</p>
      <sec id="sec-4-1">
        <title>4.1 Baseline model</title>
        <p>Before named entity recognition with CRF, tweets
are processed with a morphological analyzer for
determining the part of speech, gender, lemma,
grammatical number, case and characteristics of words.
This information is used to form features of each word
for classifying. In the baseline model we consider three
types of features: local token features, context features,
and features based on lexicons.
The token features include:
•
•
•
•
•</p>
        <p>Token initial form (lemma);
Number of symbols in a token;
Letter case. If a token begins with a capital
letter, and other letters are small then the value
of this feature is “Big Small”. If all letters are
capital then the value is “Big Big”. If all letters
are small then the value is “Small Small”. In
other cases the value is “Fence”;
Token type. The value of this feature for
lexemes is the part of speech, for punctuation
marks the value is the type of punctuation;
Symbol n-grams. The presence of prefixes,
suffixes and other n-grams from the predefined
sets in a token.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.1.2 Context-based features</title>
        <p>The group of context features includes two feature
types. The first type is local context features. It takes into
account all mentioned token feature values of nearby
words in two-word window to the right and to the left
from the current word.</p>
        <p>The second type is the bigram context feature. It
contains information about the determined named entity
type of the previous word. It helps to find named entity
borders more precisely. For example, if the person
second name is difficult for recognition, the presence of
the first name before this word makes the classification
easier.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.1.3 Features based on lexicons</title>
        <p>To improve the quality of recognition, we added
special lexicons with lists of useful objects. An object
can be a word or a phrase. The lexicons had been created
before the current work and were not changed during the
study.</p>
        <p>To calculate the lexicon features, the system matches
the text and lexicon entries. If a token is met in a matched
lexicon entry then it obtains the lexicon feature value
equal to the length of the found entry. The use of the
entry length as a feature helps to diminish the affect of
lexical ambiguity. For example, in the list of
organizations there is “Apple” as the name of a company.
But this word does not necessarily mean a company
because it has the second sense of a fruit. In the opposite,
if we found in the text the phrase “Lomonosov Moscow
State University”, which is also included in the
organization lexicon, the probability of the organization
sense is higher than in the first case. The lexicon feature
containing the matched entry length helps the system to
distinguish these two cases.</p>
        <p>
          The biggest lexicons are listed in Table 2. The overall
size of all vocabularies is more than 335 thousand
entities. These lexicons were collected from from several
sources: phonebooks, Russian Wikipedia, RuThes
thesaurus [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <sec id="sec-4-3-1">
          <title>4.2 Adaptation of NER Model to Tweets</title>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.2.1 Unsupervised word clustering</title>
        <p>In previous studies it was shown that unsupervised
word clustering on the basis of a large text collection
improves the NER performance. In our case we compare
the impact of word clusters calculated on a large news
collection and large tweet collection. For clustering we
use the Word2vec package1. It represents words with
distributional vectors (word embeddings), computed
with the neural network. The semantic similarity
between two words is calculated as the cosine measure
between two corresponding vectors. The package allows
automatic clustering of words according to their
calculated similarity. We used the c-bow model with
vector sizes equal to 300. Thus, each word has an
additional feature – the number of a cluster in that it
appears. The news collection utilized for clustering
contains two million news documents. For tweet-based
clustering we use a tweet collection consisting of
randomly extracted Russian tweets and including 8.3
million tweets.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.1.1 Two-stage prediction</title>
        <p>We suppose that for adapting a classifier to a text
collection it can be useful to take into account the entities
already labeled by the classifier and to memorize the
named entity type statistics for future use.</p>
        <p>On the first stage the classifier extracts named
entities. Then the system collects the class statistics
determined in the first stage for each word and used it for
features of the second stage. After that, new features
together with old ones participate in final classification.
These statistics can be collected from the current text (the
whole text or its part preceding to the word analysis) or
from a large text collection (collection statistics). In case
of tweet processing, texts are small therefore only the
collection statistics can be used. In our experiments this
statistics can be obtained from the FullTweetCollection
gathered from the selected user accounts or the labeled
TestTweetCollection as described in Section 3.2.</p>
        <p>For each word, the system finds all mentions of this
word in the processed collection and counts frequencies
of determined named entity types for this word. Using
these frequencies for each entity type, the system creates
additional features, which have one of three values:
no_one (if the word has not been recognized as a named
entity of the chosen type), best (if the word has been as
signed to the chosen named entity type more than in 50%
of cases), and rare (if the word has been assigned to the
chosen named entity type less than in 50% of cases).</p>
        <p>For example, if the word “Russia” was met 500 times
in a collection, and the classifier assigned it 200 times to
organizations and 300 times to locations, then the values
of the global statistics feature for the word “Russia” will
be as following: PER –no_one, ORG – rare, LOC – best.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.2 Normalization of Word Capitalization</title>
        <p>As we found that in our tweet collection the share of
misprints is not very high we did normalization only for
word capitalization. The normalization was based on the
large news collection described in Section 4.2. For each
word in this collection, we counted how many times the
word was written in letter case or capital case when it
stands not in the beginning of a sentence. The more
frequent case was considered as normal for this word.</p>
        <p>We considered the normalization in two variants:
• Variant A. All words in a tweet, except the
first one, are changed to a normal form of
capitalization;
•</p>
        <p>Variant B. All words in a tweet including the
first one are changed to a normal form of
capitalization.</p>
        <p>Figure 1 Tweets before normalization</p>
        <p>We found that the variant B produces better results
and later experimented only with this variant.</p>
        <p>Fig. 1 presents several tweets with the manual
annotation before normalization. Fig. 2 shows the same
tweets after normalization.</p>
        <p>Also the hashtag symbols were removed from a word
if this word was found in the news collection to improve
its matching with the lexicons.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4 Experiments</title>
      <p>In prepocessing we remove mentioned user accounts
with “@” and urls in the end of tweets. We consider these
data as additional, as meta-information, from which we
should not extract names.</p>
      <p>We train the described variants of our NER model on
the news collection. Table 3 shows results of named
entity recognition on the “Persons-1000” collection
(cross-validation 3:4). It can be seen that our baseline
model is quite good on the news collection and slightly
improved after adding clustering features and the
twostep approach. In this case the collection statistics is
obtained from the same “Persons-1000” collection.</p>
      <p>Then we apply the trained model to the test tweet
collection in initial capitalization and normalized
capitalization. Table 4 presents the performance of NER
models trained on the “Persons-1000” collection for the
tweet data. One can see that all models significantly
degrade on the tweet collection.</p>
      <p>
        The normalization significantly improves the
performance of NER (in contrast to other studies [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]).
Word clustering and the collection statistics improve
both NER for initial and normalized text collections.
Their impact is larger than for the news collection (Table
3). The combination of tweet and news clusters was
better than only tweet clusters possibly because of the
political and religious character of the gathered
collection. In total, the NER performance improves more
than 10% on tweet data.
      </p>
      <p>Analyzing mistakes of the best model on the
normalized collection we can see still significant share of
mistakes because of incorrectly normalized
capitalization. We can enumerate the following main
subtypes of such problems:
• ambiguous words with different capitalization (“Earth”,
“Rose”),
• words that should be capitalized in this specific
collection. For example, “Paradise” and “Hell” seem to
be specific entities in this genre of texts,
• multiword expressions in which each word is usually
written in letter case, but together the multiword
expression denotes a name and al least the first word
should be capitalized. For example, the expression
"Московский регион" (Moscow region) is normalized
incorrectly because the word "московский" is written
in letter case more frequently in the Russian news
collection.</p>
    </sec>
    <sec id="sec-6">
      <title>5 Conclusion</title>
      <p>The paper describes an approach to creating a
domain-specific tweet collection written by users
frequently discussing Islam-related issues in Russian.
We use this collection to study specific features of named
entity recognition on Twitter. We found that in contrast
to tweets collected randomly, our tweet collection
contains relatively small number of spelling errors or
strange word shortenings. Specific difficulties of our
collection for named entity recognition include a large
number of Arabic and other Eastern names (persons,
locations, organizations) and frequent use of ALL-CAPS
writing for emphasizing main words in messages.</p>
      <p>We have studied the transfer of NER model trained
on a newswire collection to the created tweet collection
and approaches to decrease the degradation of the model
because of the transfer. We found that for our specialized
text collection, the most improvement was based on
normalizing of word capitalization. Two-stage
approaches to named entity recognition and
word2vecbased clustering were also useful for our task.</p>
      <p>In future we plan to improve techniques of tweet
normalization and study NER for tweets of followers of
the selected users.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Antonova</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soloviev</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          :
          <article-title>Conditional Random Field Models for the Processing of Russian</article-title>
          .
          <source>In: Int. Conf. “Dialog</source>
          <year>2013</year>
          ”, pp.
          <fpage>27</fpage>
          -
          <lpage>44</lpage>
          . RGGU (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Bocharov</surname>
            ,
            <given-names>V.V.</given-names>
          </string-name>
          et al.:
          <source>“FactRuEval</source>
          <year>2016</year>
          :
          <article-title>Evaluation of Named Entity Recognition and Fact Extraction Systems for Russian”</article-title>
          . In: Dialog Conference. (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Brown</surname>
            ,
            <given-names>P.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Della Pietra</surname>
            ,
            <given-names>V.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Desouza</surname>
            ,
            <given-names>P.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mercer</surname>
            ,
            <given-names>R.L.</given-names>
          </string-name>
          :
          <article-title>Class-based n-gram Models of Natural Language</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>18</volume>
          (
          <issue>4</issue>
          ), pp.
          <fpage>467</fpage>
          -
          <lpage>479</lpage>
          (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Cherry</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
          </string-name>
          , H.:
          <article-title>The Unreasonable Effectiveness of Word Representations for Twitter Named Entity Recognition</article-title>
          .
          <source>In: NAACL-2015</source>
          . pp.
          <fpage>735</fpage>
          -
          <lpage>745</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Chrupala</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Efficient Induction of Probabilistic Word Classes with LDA</article-title>
          .
          <source>In: 5th Int. Joint Conf. on Natural Language Processing, IJCNLP 2011</source>
          , pp.
          <fpage>363</fpage>
          -
          <lpage>372</lpage>
          .
          <source>Asian Federation of Natural Language Processing</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Combining Distributional and Morphological Information for part of Speech Induction</article-title>
          .
          <source>In: 10th Conf</source>
          .
          <article-title>on European Chapter of the Association for Computational Linguistics</article-title>
          , EACL, 1, pp.
          <fpage>59</fpage>
          -
          <lpage>66</lpage>
          . ACL (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Derczynski</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maynard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rizzo</surname>
            , G., van Erp,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gorrell</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troncy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bontcheva</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Analysis of Named Entity Recognition and Linking for Tweets</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>51</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>32</fpage>
          -
          <lpage>49</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murnane</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karandikar</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          , Keller, N.,
          <string-name>
            <surname>Martineau</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dredze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Annotating Named Entities in Twitter Data with Crowdsourcing</article-title>
          .
          <source>In: the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazons Mechanical Turk</source>
          , pp.
          <fpage>80</fpage>
          -
          <lpage>88</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Fromreide</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sogaard</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Crowdsourcing and Annotating NER for Twitter #drift</article-title>
          .
          <source>In LREC-2014</source>
          , pp.
          <fpage>2544</fpage>
          -
          <lpage>2547</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Gareev</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tkachenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solovyev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simanovsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ivanov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Introducing Baselines for Russian Named Entity Recognition</article-title>
          .
          <source>In: 14th Int. Conf. CICLing</source>
          <year>2013</year>
          , pp.
          <fpage>329</fpage>
          -
          <lpage>342</lpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Hidayatullah</surname>
            ,
            <given-names>A.F.</given-names>
          </string-name>
          :
          <article-title>Language Tweet Characteristics of Indonesian Citizens</article-title>
          .
          <source>In: Int. Conf. IEEE-2015</source>
          . pp.
          <fpage>397</fpage>
          -
          <lpage>401</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Demographic Factors Improve Classification Performance</article-title>
          .
          <source>In: ACL-2015</source>
          , pp.
          <fpage>752</fpage>
          -
          <lpage>762</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Khoroshevsky</surname>
            ,
            <given-names>V.F.</given-names>
          </string-name>
          :
          <source>Ontology Driven Multilingual Information Extraction and Intelligent Analytics. Web Intelligence and Security</source>
          . pp.
          <fpage>237</fpage>
          -
          <lpage>262</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>I.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozerenko</surname>
            ,
            <given-names>E.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>K.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Timonina</surname>
            ,
            <given-names>N.O.</given-names>
          </string-name>
          :
          <article-title>Intelligent System for Entities Extraction (ISEE) from Natural Language Texts</article-title>
          .
          <source>In: Int. Workshop on Conceptual Structures for Extracting Natural Language Semantics-Sense, (9)</source>
          , pp.
          <fpage>17</fpage>
          -
          <lpage>25</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Nadeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sekine</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A Survey of Named Entity Recognition and Classification</article-title>
          .
          <source>LingvisticaeInvestigationes</source>
          ,
          <volume>30</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>3</fpage>
          -
          <lpage>26</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16] Paris, C.,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wan</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Differences in Language and Style Between Two Social Media Communities</article-title>
          .
          <source>In: the 6th AAAI Int. Conf. on Weblogs and Social Media</source>
          ,
          <string-name>
            <surname>ICWSM</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Podobryaev</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          :
          <article-title>Persons Recognition Using CRF Model</article-title>
          .
          <source>In: 15th All-Russian Scientific Conf. “Digital Libraries: Advanced Methods and Technologies</source>
          , Digital Collection”,
          <source>RCDL-2013</source>
          , pp.
          <fpage>255</fpage>
          -
          <lpage>258</lpage>
          . Demidov Yaroslavl State University (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Ratinov</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Design Challenges and Misconceptions in Named Entity Recognition</article-title>
          .
          <source>In: 13th Conf. on Computational Natural Language Learning</source>
          , CoNLL, pp.
          <fpage>147</fpage>
          -
          <lpage>155</lpage>
          . ACL (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Ritter</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mausam</surname>
            , Etzioni,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Named Entity Recognition in Tweets: An Experimental Study</article-title>
          .
          <source>In: EMNLP</source>
          , pp.
          <fpage>1524</fpage>
          -
          <lpage>1534</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Ritter</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          et al:
          <article-title>Open Domain Event Extraction from Twitter</article-title>
          .
          <source>In: Conf. on Knowledge Discovery and Data Mining, KDD</source>
          , pp.
          <fpage>1104</fpage>
          -
          <lpage>1112</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Cherry</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
          </string-name>
          , H.:
          <article-title>The Unreasonable Effectiveness of Word Representations for Twitter Named Entity Recognition</article-title>
          .
          <source>In: NAACL-2015</source>
          . pp.
          <fpage>735</fpage>
          -
          <lpage>745</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Trofimov</surname>
            ,
            <given-names>I.V.</given-names>
          </string-name>
          :
          <article-title>Person Name Recognition in News Articles Based on</article-title>
          the Persons-
          <volume>1000</volume>
          /1111-
          <string-name>
            <given-names>F</given-names>
            <surname>Collections</surname>
          </string-name>
          .
          <source>In: 16th All-Russian Scientific Conf. “Digital Libraries: Advanced Methods and Technologies</source>
          , Digital Collections”,
          <source>RCDL</source>
          <year>2014</year>
          , pp.
          <fpage>217</fpage>
          -
          <lpage>221</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eisenstein</surname>
          </string-name>
          , J.:
          <article-title>Putting Things in Context: Community-specific Embedding Projections for Sentiment Analysis</article-title>
          .
          <source>arXiv preprint arXiv:1511.06052</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>