=Paper= {{Paper |id=Vol-1395/paper_03 |storemode=property |title=User Interest Modeling in Twitter with Named Entity Recognition |pdfUrl=https://ceur-ws.org/Vol-1395/paper_03.pdf |volume=Vol-1395 |dblpUrl=https://dblp.org/rec/conf/msm/KaratayK15 }} ==User Interest Modeling in Twitter with Named Entity Recognition== https://ceur-ws.org/Vol-1395/paper_03.pdf
          User Interest Modeling in Twitter with Named Entity
                             Recognition

                            Deniz Karatay                                          Pinar Karagoz
                 METU Computer Engineering Dept.                        METU Computer Engineering Dept.
                      06800 Ankara, Turkey                                   06800 Ankara, Turkey
               deniz.karatay@ceng.metu.edu.tr                             karagoz@ceng.metu.edu.tr


 ABSTRACT                                                          the fact that tweets generally include grammar mistakes,
 Considering wide use of Twitter as the source of informa-         misspellings, and informal capitalization, performance of the
 tion, reaching an interesting tweet for a user among a bunch      traditional methods is incompetent on tweets and new ap-
 of tweets is challenging. In this work we propose a Named         proaches have to be generated to deal with this type of data.
 Entity Recognition (NER) based user profile modeling for          Recently, tweet representation based on segments in order to
 Twitter users and employ this model to generate personal-         extract named entities has proven its validity in NER field
 ized tweet recommendations. E↵ectiveness of the proposed          [4, 3].
 method is shown through a set of experiments.
                                                                   In this work, it is aimed to reduce the Twitter user’s e↵ort
 Categories and Subject Descriptors                                to access to the tweet carrying the information of interest.
 H.4 [Information Systems Applications]: Miscellaneous             To this aim, a tweet recommendation method under a user
                                                                   interest model generated via named entities is presented. To
                                                                   achieve our goal, a graph based user interest model is gener-
 General Terms                                                     ated via named entities extracted from user’s followees’ and
 Theory                                                            user’s own posts. In the user interest model, each included
                                                                   followee is ranked based on their interactions with the user
 Keywords                                                          via retweets and mentions, and named entities are scored
 Named Entity Recognition, Tweet Segmentation, Tweet Clas-         via ranking of the user posting them.
 sification, Tweet Ranking, Tweet Recommendation

 1.   INTRODUCTION                                                 2.   PROPOSED METHOD
 As a service that embodies both social networking and mi-         The general overview of the system architecture can also be
 croblogging, Twitter has become one of the most important         seen in Figure 1. The method used in this study segments
 communication channels with its ability of providing the          the tweets and generates named entity candidates. These
 most up-to-date and newsworthy information [6]. In this           candidates have to be validated so that they can be used as
 study, we present a technique for constructing user interest      an indicator of the user’s interest. In this step, Wikipedia
 model, in which user interests are defined by means of rela-      is chosen as a reference for a segment to be a named entity,
 tionship between the user and his friends as well as named        or not. Since our Tweet collection is in Turkish, Turkish
 entities extracted from tweets. We demonstrate the use of         Wikipedia dump published by Wikipedia is obtained.
 this model for tweet recommendation.
                                                                   For named entities to be extracted successfully, the informal
 To extract information from this large volume of tweets gen-      writing style in tweets has to be handled. Generally named
 erated by Twitter’s millions of users, Named Entity Recog-        entities are assumed as words written in uppercase or mixed
 nition (NER), which is the focus of this work, is already         case phrases where uppercased letters are at the beginning
 being used by researchers. NER can be basically defined as        and ending, and almost all of the studies bases on this as-
 identifying and categorizing certain type of data (i.e. per-      sumption. However, capitalization is not a strong indicator
 son, location, organization names, date-time and numeric          in tweet-like informal texts, sometimes even misleading. To
 expressions) in a certain type of text. On the other hand,        extract named entities in tweets, the e↵ect of the informality
 tweets are characteristically short and noisy. Considering        of the tweets has to be minimized as possible. The prepro-
                                                                   cessing tasks applied can be divided into two logical group:.
                                                                   Pre-segmenting, and Correcting. Removal of links, hash-
                                                                   tags, mentions, conjunctives, stop words, vocatives, slang
                                                                   words and elimination of punctuation are considered as pre-
 Copyright c 2015 held by author(s)/owner(s); copying permitted    segmentation. It is assumed that parts in the texts before
 only for private and academic purposes.                           and after a redundant word, or a punctuation mark can-
 Published as part of the #Microposts2015 Workshop proceedings,
 available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395)   not form a named entity together, therefore every removal
                                                                   of a word is considered as it segments the tweet as well as
 #Microposts2015, May 18th, 2015, Florence, Italy.                 punctuation does it naturally. Removal of repeating charac-
                                                                   ters that are used to express a feeling such as exaggerating,




· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015
                                                     Figure 1: System Architecture


 or yelling, handling mistyping and asciification related prob-      used in the same way in this study. The stickiness function
 lems are considered as correcting and can be thought of con-        basically measures the stickiness of a segment or a tweet
 version of tweets from informal to formal. In the following         represented based on word collocations. A low stickiness
 subsections, we describe the NER and user profile modeling          value of a segment means that words are not used commonly
 and recommendation steps in more detail.                            together and can be further split to obtain a more suitable
                                                                     word collocation. On the other hand, a high stickiness value
 2.1    Finding Named Entities                                       of a segment indicates that words in the segment are used
 In this study, the idea of segmenting a tweet text into a set of    together often and represent a word collocation, therefore
 phrases, each of which appears more than random occurence           cannot be further split. In order to determine the correct
 [1, 4] is adopted. Therefore, a corpus serving this purpose         segmentation, the objective function above is used, where
 in Turkish is needed. To this aim, TS Corpus, which in-             a tweet representation with the maximum stickiness is cho-
 dexes Wikipedia articles and also Tweets [5], is used. In the       sen to be the correct segmentation. Instead of generating
 proposed solution, TS Corpus is used for gathering statis-          all possible segmentations and compute their stickiness, dy-
 tical information for various segmentation combinations by          namic programming algorithm described in [4] is adapted to
 means of a dynamic programming algorithm. While collect-            this study to compute stickiness values efficiently. The algo-
 ing statistical information for segment combinations, tweet         rithm basically segments the longer segment, which can be
 collection of TS Corpus is also used while computing prob-          tweet itself, into two segments and evaluates the stickiness
 ability of a segment to be a valid named entity, which is           of the resultant segments recursively. More formally, given
 di↵erent from the previous studies. The knowledge base              any segment s = w1 w2 ...wn , adjacent binary segmentations
 that is constructed using Turkish Wikipedia dump is used            s1 = w1 ...wj and s2 = wj + 1...wn is obtained by satisfying
 to validate the candidate named entities.                           the following equation.
                                                                                     arg max F (s) = F (s1 ) + F (s2 )         (2)
 Segmentation constitutes the core part of named entity recog-                          s1 ,s2
 nition method. The aim here is to split a tweet into consecu-
                                                                     Thus far, tweets are segmented making use of the stickiness
 tive segments. Each segment contains at least one word. For
                                                                     function. In the result of this phase, tweet segments, which
 the optimal segmentation, the following objective function
                                                                     are candidate named entities, are obtained. These candidate
 is used, where F is the stickiness function, t is an individual
                                                                     named entities have to be validated whether they are real
 tweet, and s represents a segment.
                                                                     named entities or not, so that they can be used as an indi-
                                     n
                                     X                               cator of the user’s interest. For this purpose, as explained
                   arg max F (t) =         F (si )            (1)    before, Wikipedia is chosen as a reference for a segment to be
                       s1 ...sn
                                     i=1
                                                                     a named entity, and a graph-based knowledge-base based on
 Although the term stickiness is generally used for express-         Wikipedia is constructed. If the segment, which is actually
 ing tendency of a user to stay longer on a web page by a            a candidate named entity, matches exactly with a Wikipedia
 user, Li et. al defined it as the metric of a word group to be      title in the constructed knowledge base, then it is accepted
 seen together in documents frequently, or not [4] and it is         to be a named entity. In case of inexact match, we use the




                                                                                                                              18
· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015
                                                                   named entity, C represents the frequency count of a named
                                                                   entity for a user, n represents the count of friends included
                                                                   in the user interest model, RR represents the relative rank-
                                                                   ing score of a followed, and U represents the user himself.
                                                                   With the same approach, the final score of all of the named
                                                                   entities appearing in the user interest model is calculated.
                                                                                          n
                                                                                          X
                                                                                SCN E =         RRi · Ci + RRU · CU           (3)
                                                                                          i=1

                                                                   After overall score is calculated for all of the named entities
                                                                   in the user interest model, final scores for candidate tweets
                                                                   are calculated in the following approach: Overall score of
                                                                   named entities in NE representation of a candidate tweet
                                                                   are multiplied with the frequency count in the NE represen-
                                                                   tation of itself. This operation is done for every named entity
                                                                   in the tweet representation, and then by summing these val-
                                                                   ues, final score of a candidate tweet is obtained. If a named
 Figure 2:     Structure of the User Interest Model                entity in a candidate tweet’s NE representation, does not ap-
 Graph                                                             pear in the user interest model, its overall score is accepted
                                                                   as 0 and not taken into consideration assuming the user is
                                                                   not interested in the subject that particular named entity
 Levenshtein distance [2] to measure the similarity of a seg-
                                                                   represents. Once final scores for all candidate tweets are
 ment to a Wikipedia title.
                                                                   calculated, candidate tweets are sorted in descending order,
                                                                   and hence, they are ranked.
 2.2    Generating User Interest Model based on                                              m
                                                                                             X
        Named Entities                                                              SCT =          SCN Ei · CN Ei             (4)
 At this step, named entities with their frequency counts in                                 i=1
 a tweet obtained from followees’ posts, and followees’ rela-
 tive ranking obtained in data gathering phase is processed as     3.   EXPERIMENTAL RESULTS
 shown in Figure 1. Using these data, a user interest model is     To evaluate the system from recommendation point of view,
 generated. It is basically a graph based relationship model.      two types of datasets as candidate tweets for recommenda-
 Let G = (V, E) be a weighted labelled graph with the node         tion and two types of user groups to recommend tweets are
 set V and edge set E. Node set V is labelled with the             formed. The first dataset of candidate tweets, GN RL, is
 label set L1 where L1 2 {Root, F ollowee, N amedEntity}           a general dataset containing 100 tweets crawled from news-
 and Edge set E is labelled with the label set L2 where            papers’ Twitter accounts. The second dataset, P SN L is a
 L2 2 {F ollows, W rites}. In other words, a user interest         personal dataset containing 100 tweets that are crawled from
 model graph has three types of nodes; Root, Friend, Named         the followees of followees of the selected users. There are 10
 Entity, along with two types of weighted edges; Writes, and       users volunteered for this experiment where half of them
 Follows. Weight of Writes edge represents the appearance          are active Twitter users, whereas the other half are inactive
 count of a named entity for a followere’s posts where weight      Twitter users. Active Users are the users that use Twit-
 of the Follows edge represents relative ranking of a followed.    ter frequently, have retweeting and mentioning habits, and
 Therefore, a twitter profile is represented as Root node Fol-     update followed list when necessary where Inactive Users
 lows one or many Followees, and a Followee node Writes            do not post, retweet, or mention often, and do not update
 one or many Named Entities. The structure of the graph is         followee list frequently. Volunteered users are categorized
 shown in Figure 2.                                                on the basis of the information they provided about their
                                                                   Twitter usage habits.
 2.3    Tweet Recommendation
 Determining whether a tweet is interesting or not is achieved     For each user, user interest model is constructed under SCP
 by comparing NE representation of the tweet with the gen-         measure on Wikipedia Corpus along with length normaliza-
 erated user interest model. This comparison results in a          tion for stickiness function, which gives the best results ac-
 ranking of candidate tweets. As the first step, candidate         cording to the validation experiments. In addition, the best
 tweets are processed to obtain their NE representations. NE       NT and NF values are experimentally obtained, therefore
 representation of a tweet simply includes the NEs, and their      20 followees and 10 tweets of each followed are included in
 frequency counts. In order to compare with the candidate          the model. Candidate tweets are scored by comparing with
 tweet, user interest model has to be interpreted by includ-       user’s model as explained in Section 2.3 and then ranked.
 ing the ranking score factor of the friends. Every followee’s     Meanwhile, each user is asked to classify and score tweets
 named entities and their appearance counts are first multi-       in GN RL and P SN L datasets. Volunteered users made a
 plied with the friend’s ranking, and then summed. There-          two-step evaluation on each tweet for each dataset. They
 fore, a set of named entities with their scores based on the      are asked to mark the tweet as interesting or uninterest-
 user interest model is obtained. The mathematical interpre-       ing, and then if the tweet is interesting, they are asked to
 tation to calculate the score of a single named entity is given   score the tweet in the range of [1 3] where 1 is the least
 in Equation 3, where SCN E represents the overall score of a      score, and 3 is the highest score for interestingness. In the




                                                                                                                             19
· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015
                                                    Classification Acc. (%)    Ranking Acc. (nDCG)
                                                     GN RL          P SN L      GN RL        P SN L
                                        U ser1         47              49        0.520        0.612
                                        U ser2         42              39        0.573        0.654
                         Inactive Users U ser3         36              37        0.433        0.478
                                        U ser4         43              36        0.322        0.301
                                        U ser5         49              47        0.567        0.514
                             Average (IU)             43.40          41.60      0.483        0.512
                                        U ser6         68              64        0.777        0.909
                                        U ser7         66              61        0.699        0.768
                         Active Users   U ser8         62              56        0.760        0.782
                                        U ser9         71              72        0.720        0.815
                                        U ser10        72              65        0.601        0.677
                            Average (AU)              67.80          63.60      0.711        0.790
                          Average (Overall)                   54.10                    0.624



          Table 1: Tweet Recommendation Experiment Results with respect to the Baseline Method

                                                    Classification Acc. (%)    Ranking Acc. (nDCG)
                                                     GN RL          P SN L      GN RL        P SN L
                                         U ser1        69              66        0.723        0.773
                                         U ser2        62              58        0.684        0.796
                         Inactive Users  U ser3        52              55        0.656        0.616
                                         U ser4        67              52        0.590        0.623
                                         U ser5        72              69        0.734        0.691
                              Average (IU)            64.40          60.00      0.677        0.700
                                         U ser6        88              86        0.809        0.958
                                         U ser7        79              74        0.795        0.888
                          Active Users   U ser8        74              68        0.812        0.826
                                         U ser9        88              85        0.815        0.904
                                         U ser10       80              77        0.773        0.872
                             Average (AU)             81.80            78       0.801        0.890
                           Average (Overall)                  71.05                    0.767



         Table 2: Tweet Recommendation Experiment Results with Respect to the Proposed Method


 baseline method, followee rankings are neglected and hence        pendent on the morphology of the language. Experimental
 every named entity has equal weight. Generated recommen-          results show that the proposed method is capable of decid-
 dations are compared against the user preferences in terms        ing on tweets to be recommended according to the user’s
 of classification, and ranking.                                   interest. Experimental results show the applicability of the
                                                                   approach for recommending tweets.
 The results in Table 1 show that the baseline method is able
 to decide whether a tweet is interesting for a user or not        5.   REFERENCES
 with the accuracy of 54,10% on average with classification        [1] D. Downey, M. Broadhead, and O. Etzioni. Locating
 and 0,624 nDCG value on average with ranking, which are               complex named entities in web text. In Proceedings of
 lower than the results of our system. The performance of              the 20th International Joint Conference on Artificial
 the baseline method in some cases decreases down to 36%               Intelligence, IJCAI’07, pages 2733–2739, San Francisco,
 correct prediction at classification, and 0,322 nDCG value            CA, USA, 2007. Morgan Kaufmann Publishers Inc.
 at ranking quality. On the other hand, the results shown          [2] V. Levenshtein. Binary Codes Capable of Correcting
 in Table 2 shows that the proposed system is able to decide           Deletions, Insertions and Reversals. Soviet Physics
 whether a tweet is interesting for a user or not with the             Doklady, 10:707, 1966.
 accuracy of 71,05% on average for classification and 0,767
                                                                   [3] C. Li, A. Sun, J. Weng, and Q. He. Exploiting hybrid
 nDCG value on average for ranking. Given the suitable user
                                                                       contexts for tweet segmentation. In Proceedings of the
 habits, performance of the system increases up to the 88%
                                                                       36th international ACM SIGIR conference on Research
 correct prediction for classification, and 0,958 nDCG value
                                                                       and development in information retrieval, SIGIR ’13,
 at ranking quality. The comparison of two tables show that
                                                                       pages 523–532, New York, NY, USA, 2013. ACM.
 the proposed user interest modeling approach increases the
 performance.                                                      [4] C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and
                                                                       B.-S. Lee. Twiner: named entity recognition in targeted
                                                                       twitter stream. In Proceedings of the 35th international
 4.   CONCLUSIONS                                                      ACM SIGIR conference on Research and development
 This paper proposes a new approach to Twitter user mod-               in information retrieval, SIGIR ’12, pages 721–730,
 eling and tweet recommendation by making use of named                 New York, NY, USA, 2012. ACM.
 entities extracted from tweets. A powerful aspect of NER          [5] T. Sezer. TS Corpus, The Turkish Corpus, 2014.
 approach adopted in this study, tweet segmentation, is that           [Online; accessed 14-December-2014].
 it does not require an annotated large volume of training         [6] Twitter. About twitter, inc., 2014. [Online; accessed
 data to extract named entities, therefore a huge overload of          14-December-2014].
 annotation is avoided. In addition, this approach is not de-




                                                                                                                           20
· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015