<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>User Interest Modeling in Twitter with Named Entity Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Deniz Karatay</string-name>
          <email>deniz.karatay@ceng.metu.edu.tr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pinar Karagoz</string-name>
          <email>karagoz@ceng.metu.edu.tr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>METU Computer Engineering Dept.</institution>
          ,
          <addr-line>06800 Ankara</addr-line>
          ,
          <country country="TR">Turkey</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <volume>1395</volume>
      <fpage>17</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>Considering wide use of Twitter as the source of information, reaching an interesting tweet for a user among a bunch of tweets is challenging. In this work we propose a Named Entity Recognition (NER) based user profile modeling for Twitter users and employ this model to generate personalized tweet recommendations. E↵ectiveness of the proposed method is shown through a set of experiments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Named Entity Recognition</kwd>
        <kwd>Tweet Segmentation</kwd>
        <kwd>Tweet Classification</kwd>
        <kwd>Tweet Ranking</kwd>
        <kwd>Tweet Recommendation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>For named entities to be extracted successfully, the informal
writing style in tweets has to be handled. Generally named
entities are assumed as words written in uppercase or mixed
case phrases where uppercased letters are at the beginning
and ending, and almost all of the studies bases on this
assumption. However, capitalization is not a strong indicator
in tweet-like informal texts, sometimes even misleading. To
extract named entities in tweets, the e↵ect of the informality
of the tweets has to be minimized as possible. The
preprocessing tasks applied can be divided into two logical group:.
Pre-segmenting, and Correcting. Removal of links,
hashtags, mentions, conjunctives, stop words, vocatives, slang
words and elimination of punctuation are considered as
presegmentation. It is assumed that parts in the texts before
and after a redundant word, or a punctuation mark
cannot form a named entity together, therefore every removal
of a word is considered as it segments the tweet as well as
punctuation does it naturally. Removal of repeating
characters that are used to express a feeling such as exaggerating,
or yelling, handling mistyping and asciification related
problems are considered as correcting and can be thought of
conversion of tweets from informal to formal. In the following
subsections, we describe the NER and user profile modeling
and recommendation steps in more detail.</p>
    </sec>
    <sec id="sec-2">
      <title>2.1 Finding Named Entities</title>
      <p>
        In this study, the idea of segmenting a tweet text into a set of
phrases, each of which appears more than random occurence
[
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ] is adopted. Therefore, a corpus serving this purpose
in Turkish is needed. To this aim, TS Corpus, which
indexes Wikipedia articles and also Tweets [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], is used. In the
proposed solution, TS Corpus is used for gathering
statistical information for various segmentation combinations by
means of a dynamic programming algorithm. While
collecting statistical information for segment combinations, tweet
collection of TS Corpus is also used while computing
probability of a segment to be a valid named entity, which is
die↵rent from the previous studies. The knowledge base
that is constructed using Turkish Wikipedia dump is used
to validate the candidate named entities.
      </p>
      <p>Segmentation constitutes the core part of named entity
recognition method. The aim here is to split a tweet into
consecutive segments. Each segment contains at least one word. For
the optimal segmentation, the following objective function
is used, where F is the stickiness function, t is an individual
tweet, and s represents a segment.</p>
      <p>
        arg max F (t) =
s1...sn
n
X F (si)
i=1
(1)
Although the term stickiness is generally used for
expressing tendency of a user to stay longer on a web page by a
user, Li et. al defined it as the metric of a word group to be
seen together in documents frequently, or not [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and it is
used in the same way in this study. The stickiness function
basically measures the stickiness of a segment or a tweet
represented based on word collocations. A low stickiness
value of a segment means that words are not used commonly
together and can be further split to obtain a more suitable
word collocation. On the other hand, a high stickiness value
of a segment indicates that words in the segment are used
together often and represent a word collocation, therefore
cannot be further split. In order to determine the correct
segmentation, the objective function above is used, where
a tweet representation with the maximum stickiness is
chosen to be the correct segmentation. Instead of generating
all possible segmentations and compute their stickiness,
dynamic programming algorithm described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is adapted to
this study to compute stickiness values eciently. The
algorithm basically segments the longer segment, which can be
tweet itself, into two segments and evaluates the stickiness
of the resultant segments recursively. More formally, given
any segment s = w1w2...wn , adjacent binary segmentations
s1 = w1...wj and s2 = wj + 1...wn is obtained by satisfying
the following equation.
      </p>
      <p>
        arg max F (s) = F (s1) + F (s2)
s1,s2
(2)
Thus far, tweets are segmented making use of the stickiness
function. In the result of this phase, tweet segments, which
are candidate named entities, are obtained. These candidate
named entities have to be validated whether they are real
named entities or not, so that they can be used as an
indicator of the user’s interest. For this purpose, as explained
before, Wikipedia is chosen as a reference for a segment to be
a named entity, and a graph-based knowledge-base based on
Wikipedia is constructed. If the segment, which is actually
a candidate named entity, matches exactly with a Wikipedia
title in the constructed knowledge base, then it is accepted
to be a named entity. In case of inexact match, we use the
Levenshtein distance [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to measure the similarity of a
segment to a Wikipedia title.
2.2 Generating User Interest Model based on
      </p>
      <p>Named Entities
At this step, named entities with their frequency counts in
a tweet obtained from followees’ posts, and followees’
relative ranking obtained in data gathering phase is processed as
shown in Figure 1. Using these data, a user interest model is
generated. It is basically a graph based relationship model.
Let G = (V, E) be a weighted labelled graph with the node
set V and edge set E. Node set V is labelled with the
label set L1 where L1 2 { Root, F ollowee, N amedEntity}
and Edge set E is labelled with the label set L2 where
L2 2 { F ollows, W rites}. In other words, a user interest
model graph has three types of nodes; Root, Friend, Named
Entity, along with two types of weighted edges; Writes, and
Follows. Weight of Writes edge represents the appearance
count of a named entity for a followere’s posts where weight
of the Follows edge represents relative ranking of a followed.
Therefore, a twitter profile is represented as Root node
Follows one or many Followees, and a Followee node Writes
one or many Named Entities. The structure of the graph is
shown in Figure 2.</p>
    </sec>
    <sec id="sec-3">
      <title>2.3 Tweet Recommendation</title>
      <p>Determining whether a tweet is interesting or not is achieved
by comparing NE representation of the tweet with the
generated user interest model. This comparison results in a
ranking of candidate tweets. As the first step, candidate
tweets are processed to obtain their NE representations. NE
representation of a tweet simply includes the NEs, and their
frequency counts. In order to compare with the candidate
tweet, user interest model has to be interpreted by
including the ranking score factor of the friends. Every followee’s
named entities and their appearance counts are first
multiplied with the friend’s ranking, and then summed.
Therefore, a set of named entities with their scores based on the
user interest model is obtained. The mathematical
interpretation to calculate the score of a single named entity is given
in Equation 3, where SCNE represents the overall score of a
named entity, C represents the frequency count of a named
entity for a user, n represents the count of friends included
in the user interest model, RR represents the relative
ranking score of a followed, and U represents the user himself.
With the same approach, the final score of all of the named
entities appearing in the user interest model is calculated.</p>
      <p>SCNE =
n
X RRi · Ci + RRU · CU
i=1
After overall score is calculated for all of the named entities
in the user interest model, final scores for candidate tweets
are calculated in the following approach: Overall score of
named entities in NE representation of a candidate tweet
are multiplied with the frequency count in the NE
representation of itself. This operation is done for every named entity
in the tweet representation, and then by summing these
values, final score of a candidate tweet is obtained. If a named
entity in a candidate tweet’s NE representation, does not
appear in the user interest model, its overall score is accepted
as 0 and not taken into consideration assuming the user is
not interested in the subject that particular named entity
represents. Once final scores for all candidate tweets are
calculated, candidate tweets are sorted in descending order,
and hence, they are ranked.</p>
      <p>SCT =
m
X SCNEi · CNEi
i=1</p>
    </sec>
    <sec id="sec-4">
      <title>3. EXPERIMENTAL RESULTS</title>
      <p>To evaluate the system from recommendation point of view,
two types of datasets as candidate tweets for
recommendation and two types of user groups to recommend tweets are
formed. The first dataset of candidate tweets, GN RL, is
a general dataset containing 100 tweets crawled from
newspapers’ Twitter accounts. The second dataset, P SN L is a
personal dataset containing 100 tweets that are crawled from
the followees of followees of the selected users. There are 10
users volunteered for this experiment where half of them
are active Twitter users, whereas the other half are inactive
Twitter users. Active Users are the users that use
Twitter frequently, have retweeting and mentioning habits, and
update followed list when necessary where Inactive Users
do not post, retweet, or mention often, and do not update
followee list frequently. Volunteered users are categorized
on the basis of the information they provided about their
Twitter usage habits.</p>
      <p>For each user, user interest model is constructed under SCP
measure on Wikipedia Corpus along with length
normalization for stickiness function, which gives the best results
according to the validation experiments. In addition, the best
NT and NF values are experimentally obtained, therefore
20 followees and 10 tweets of each followed are included in
the model. Candidate tweets are scored by comparing with
user’s model as explained in Section 2.3 and then ranked.
Meanwhile, each user is asked to classify and score tweets
in GN RL and P SN L datasets. Volunteered users made a
two-step evaluation on each tweet for each dataset. They
are asked to mark the tweet as interesting or
uninteresting, and then if the tweet is interesting, they are asked to
score the tweet in the range of [1 3] where 1 is the least
score, and 3 is the highest score for interestingness. In the
(3)
(4)
Active Users
Inactive Users
Active Users</p>
      <p>Classification Acc. (%)
GNRL P SNL
47 49
42 39
36 37
43 36
49 47
43.40 41.60
68 64
66 61
62 56
71 72
72 65
67.80 63.60</p>
      <p>54.10
Classification Acc. (%)
GNRL P SNL
69 66
62 58
52 55
67 52
72 69
64.40 60.00
88 86
79 74
74 68
88 85
80 77
81.80 78
71.05</p>
      <p>Ranking Acc. (nDCG)</p>
      <p>GNRL P SNL
0.520 0.612
0.573 0.654
0.433 0.478
0.322 0.301
0.567 0.514
0.483 0.512
0.777 0.909
0.699 0.768
0.760 0.782
0.720 0.815
0.601 0.677
0.711 0.790
baseline method, followee rankings are neglected and hence
every named entity has equal weight. Generated
recommendations are compared against the user preferences in terms
of classification, and ranking.</p>
      <p>The results in Table 1 show that the baseline method is able
to decide whether a tweet is interesting for a user or not
with the accuracy of 54,10% on average with classification
and 0,624 nDCG value on average with ranking, which are
lower than the results of our system. The performance of
the baseline method in some cases decreases down to 36%
correct prediction at classification, and 0,322 nDCG value
at ranking quality. On the other hand, the results shown
in Table 2 shows that the proposed system is able to decide
whether a tweet is interesting for a user or not with the
accuracy of 71,05% on average for classification and 0,767
nDCG value on average for ranking. Given the suitable user
habits, performance of the system increases up to the 88%
correct prediction for classification, and 0,958 nDCG value
at ranking quality. The comparison of two tables show that
the proposed user interest modeling approach increases the
performance.</p>
    </sec>
    <sec id="sec-5">
      <title>4. CONCLUSIONS</title>
      <p>This paper proposes a new approach to Twitter user
modeling and tweet recommendation by making use of named
entities extracted from tweets. A powerful aspect of NER
approach adopted in this study, tweet segmentation, is that
it does not require an annotated large volume of training
data to extract named entities, therefore a huge overload of
annotation is avoided. In addition, this approach is not
dependent on the morphology of the language. Experimental
results show that the proposed method is capable of
deciding on tweets to be recommended according to the user’s
interest. Experimental results show the applicability of the
approach for recommending tweets.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Downey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Broadhead</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <article-title>Locating complex named entities in web text</article-title>
          .
          <source>In Proceedings of the 20th International Joint Conference on Artificial Intelligence, IJCAI'07</source>
          , pages
          <fpage>2733</fpage>
          -
          <lpage>2739</lpage>
          , San Francisco, CA, USA,
          <year>2007</year>
          . Morgan Kaufmann Publishers Inc.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Levenshtein</surname>
          </string-name>
          .
          <article-title>Binary Codes Capable of Correcting Deletions, Insertions and Reversals</article-title>
          .
          <source>Soviet Physics Doklady</source>
          ,
          <volume>10</volume>
          :
          <fpage>707</fpage>
          ,
          <year>1966</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          .
          <article-title>Exploiting hybrid contexts for tweet segmentation</article-title>
          .
          <source>In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval</source>
          ,
          <source>SIGIR '13</source>
          , pages
          <fpage>523</fpage>
          -
          <lpage>532</lpage>
          , New York, NY, USA,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Datta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sun</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.-S.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Twiner: named entity recognition in targeted twitter stream</article-title>
          .
          <source>In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval</source>
          ,
          <source>SIGIR '12</source>
          , pages
          <fpage>721</fpage>
          -
          <lpage>730</lpage>
          , New York, NY, USA,
          <year>2012</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sezer</surname>
          </string-name>
          . TS Corpus,
          <source>The Turkish Corpus</source>
          ,
          <year>2014</year>
          . [Online; accessed 14-December-2014].
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Twitter</surname>
          </string-name>
          .
          <article-title>About twitter, inc</article-title>
          .,
          <year>2014</year>
          . [Online; accessed 14-December-2014].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>