=Paper=
{{Paper
|id=Vol-1395/paper_03
|storemode=property
|title=User Interest Modeling in Twitter with Named Entity Recognition
|pdfUrl=https://ceur-ws.org/Vol-1395/paper_03.pdf
|volume=Vol-1395
|dblpUrl=https://dblp.org/rec/conf/msm/KaratayK15
}}
==User Interest Modeling in Twitter with Named Entity Recognition==
User Interest Modeling in Twitter with Named Entity Recognition Deniz Karatay Pinar Karagoz METU Computer Engineering Dept. METU Computer Engineering Dept. 06800 Ankara, Turkey 06800 Ankara, Turkey deniz.karatay@ceng.metu.edu.tr karagoz@ceng.metu.edu.tr ABSTRACT the fact that tweets generally include grammar mistakes, Considering wide use of Twitter as the source of informa- misspellings, and informal capitalization, performance of the tion, reaching an interesting tweet for a user among a bunch traditional methods is incompetent on tweets and new ap- of tweets is challenging. In this work we propose a Named proaches have to be generated to deal with this type of data. Entity Recognition (NER) based user profile modeling for Recently, tweet representation based on segments in order to Twitter users and employ this model to generate personal- extract named entities has proven its validity in NER field ized tweet recommendations. E↵ectiveness of the proposed [4, 3]. method is shown through a set of experiments. In this work, it is aimed to reduce the Twitter user’s e↵ort Categories and Subject Descriptors to access to the tweet carrying the information of interest. H.4 [Information Systems Applications]: Miscellaneous To this aim, a tweet recommendation method under a user interest model generated via named entities is presented. To achieve our goal, a graph based user interest model is gener- General Terms ated via named entities extracted from user’s followees’ and Theory user’s own posts. In the user interest model, each included followee is ranked based on their interactions with the user Keywords via retweets and mentions, and named entities are scored Named Entity Recognition, Tweet Segmentation, Tweet Clas- via ranking of the user posting them. sification, Tweet Ranking, Tweet Recommendation 1. INTRODUCTION 2. PROPOSED METHOD As a service that embodies both social networking and mi- The general overview of the system architecture can also be croblogging, Twitter has become one of the most important seen in Figure 1. The method used in this study segments communication channels with its ability of providing the the tweets and generates named entity candidates. These most up-to-date and newsworthy information [6]. In this candidates have to be validated so that they can be used as study, we present a technique for constructing user interest an indicator of the user’s interest. In this step, Wikipedia model, in which user interests are defined by means of rela- is chosen as a reference for a segment to be a named entity, tionship between the user and his friends as well as named or not. Since our Tweet collection is in Turkish, Turkish entities extracted from tweets. We demonstrate the use of Wikipedia dump published by Wikipedia is obtained. this model for tweet recommendation. For named entities to be extracted successfully, the informal To extract information from this large volume of tweets gen- writing style in tweets has to be handled. Generally named erated by Twitter’s millions of users, Named Entity Recog- entities are assumed as words written in uppercase or mixed nition (NER), which is the focus of this work, is already case phrases where uppercased letters are at the beginning being used by researchers. NER can be basically defined as and ending, and almost all of the studies bases on this as- identifying and categorizing certain type of data (i.e. per- sumption. However, capitalization is not a strong indicator son, location, organization names, date-time and numeric in tweet-like informal texts, sometimes even misleading. To expressions) in a certain type of text. On the other hand, extract named entities in tweets, the e↵ect of the informality tweets are characteristically short and noisy. Considering of the tweets has to be minimized as possible. The prepro- cessing tasks applied can be divided into two logical group:. Pre-segmenting, and Correcting. Removal of links, hash- tags, mentions, conjunctives, stop words, vocatives, slang words and elimination of punctuation are considered as pre- Copyright c 2015 held by author(s)/owner(s); copying permitted segmentation. It is assumed that parts in the texts before only for private and academic purposes. and after a redundant word, or a punctuation mark can- Published as part of the #Microposts2015 Workshop proceedings, available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395) not form a named entity together, therefore every removal of a word is considered as it segments the tweet as well as #Microposts2015, May 18th, 2015, Florence, Italy. punctuation does it naturally. Removal of repeating charac- ters that are used to express a feeling such as exaggerating, · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015 Figure 1: System Architecture or yelling, handling mistyping and asciification related prob- used in the same way in this study. The stickiness function lems are considered as correcting and can be thought of con- basically measures the stickiness of a segment or a tweet version of tweets from informal to formal. In the following represented based on word collocations. A low stickiness subsections, we describe the NER and user profile modeling value of a segment means that words are not used commonly and recommendation steps in more detail. together and can be further split to obtain a more suitable word collocation. On the other hand, a high stickiness value 2.1 Finding Named Entities of a segment indicates that words in the segment are used In this study, the idea of segmenting a tweet text into a set of together often and represent a word collocation, therefore phrases, each of which appears more than random occurence cannot be further split. In order to determine the correct [1, 4] is adopted. Therefore, a corpus serving this purpose segmentation, the objective function above is used, where in Turkish is needed. To this aim, TS Corpus, which in- a tweet representation with the maximum stickiness is cho- dexes Wikipedia articles and also Tweets [5], is used. In the sen to be the correct segmentation. Instead of generating proposed solution, TS Corpus is used for gathering statis- all possible segmentations and compute their stickiness, dy- tical information for various segmentation combinations by namic programming algorithm described in [4] is adapted to means of a dynamic programming algorithm. While collect- this study to compute stickiness values efficiently. The algo- ing statistical information for segment combinations, tweet rithm basically segments the longer segment, which can be collection of TS Corpus is also used while computing prob- tweet itself, into two segments and evaluates the stickiness ability of a segment to be a valid named entity, which is of the resultant segments recursively. More formally, given di↵erent from the previous studies. The knowledge base any segment s = w1 w2 ...wn , adjacent binary segmentations that is constructed using Turkish Wikipedia dump is used s1 = w1 ...wj and s2 = wj + 1...wn is obtained by satisfying to validate the candidate named entities. the following equation. arg max F (s) = F (s1 ) + F (s2 ) (2) Segmentation constitutes the core part of named entity recog- s1 ,s2 nition method. The aim here is to split a tweet into consecu- Thus far, tweets are segmented making use of the stickiness tive segments. Each segment contains at least one word. For function. In the result of this phase, tweet segments, which the optimal segmentation, the following objective function are candidate named entities, are obtained. These candidate is used, where F is the stickiness function, t is an individual named entities have to be validated whether they are real tweet, and s represents a segment. named entities or not, so that they can be used as an indi- n X cator of the user’s interest. For this purpose, as explained arg max F (t) = F (si ) (1) before, Wikipedia is chosen as a reference for a segment to be s1 ...sn i=1 a named entity, and a graph-based knowledge-base based on Although the term stickiness is generally used for express- Wikipedia is constructed. If the segment, which is actually ing tendency of a user to stay longer on a web page by a a candidate named entity, matches exactly with a Wikipedia user, Li et. al defined it as the metric of a word group to be title in the constructed knowledge base, then it is accepted seen together in documents frequently, or not [4] and it is to be a named entity. In case of inexact match, we use the 18 · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015 named entity, C represents the frequency count of a named entity for a user, n represents the count of friends included in the user interest model, RR represents the relative rank- ing score of a followed, and U represents the user himself. With the same approach, the final score of all of the named entities appearing in the user interest model is calculated. n X SCN E = RRi · Ci + RRU · CU (3) i=1 After overall score is calculated for all of the named entities in the user interest model, final scores for candidate tweets are calculated in the following approach: Overall score of named entities in NE representation of a candidate tweet are multiplied with the frequency count in the NE represen- tation of itself. This operation is done for every named entity in the tweet representation, and then by summing these val- ues, final score of a candidate tweet is obtained. If a named Figure 2: Structure of the User Interest Model entity in a candidate tweet’s NE representation, does not ap- Graph pear in the user interest model, its overall score is accepted as 0 and not taken into consideration assuming the user is not interested in the subject that particular named entity Levenshtein distance [2] to measure the similarity of a seg- represents. Once final scores for all candidate tweets are ment to a Wikipedia title. calculated, candidate tweets are sorted in descending order, and hence, they are ranked. 2.2 Generating User Interest Model based on m X Named Entities SCT = SCN Ei · CN Ei (4) At this step, named entities with their frequency counts in i=1 a tweet obtained from followees’ posts, and followees’ rela- tive ranking obtained in data gathering phase is processed as 3. EXPERIMENTAL RESULTS shown in Figure 1. Using these data, a user interest model is To evaluate the system from recommendation point of view, generated. It is basically a graph based relationship model. two types of datasets as candidate tweets for recommenda- Let G = (V, E) be a weighted labelled graph with the node tion and two types of user groups to recommend tweets are set V and edge set E. Node set V is labelled with the formed. The first dataset of candidate tweets, GN RL, is label set L1 where L1 2 {Root, F ollowee, N amedEntity} a general dataset containing 100 tweets crawled from news- and Edge set E is labelled with the label set L2 where papers’ Twitter accounts. The second dataset, P SN L is a L2 2 {F ollows, W rites}. In other words, a user interest personal dataset containing 100 tweets that are crawled from model graph has three types of nodes; Root, Friend, Named the followees of followees of the selected users. There are 10 Entity, along with two types of weighted edges; Writes, and users volunteered for this experiment where half of them Follows. Weight of Writes edge represents the appearance are active Twitter users, whereas the other half are inactive count of a named entity for a followere’s posts where weight Twitter users. Active Users are the users that use Twit- of the Follows edge represents relative ranking of a followed. ter frequently, have retweeting and mentioning habits, and Therefore, a twitter profile is represented as Root node Fol- update followed list when necessary where Inactive Users lows one or many Followees, and a Followee node Writes do not post, retweet, or mention often, and do not update one or many Named Entities. The structure of the graph is followee list frequently. Volunteered users are categorized shown in Figure 2. on the basis of the information they provided about their Twitter usage habits. 2.3 Tweet Recommendation Determining whether a tweet is interesting or not is achieved For each user, user interest model is constructed under SCP by comparing NE representation of the tweet with the gen- measure on Wikipedia Corpus along with length normaliza- erated user interest model. This comparison results in a tion for stickiness function, which gives the best results ac- ranking of candidate tweets. As the first step, candidate cording to the validation experiments. In addition, the best tweets are processed to obtain their NE representations. NE NT and NF values are experimentally obtained, therefore representation of a tweet simply includes the NEs, and their 20 followees and 10 tweets of each followed are included in frequency counts. In order to compare with the candidate the model. Candidate tweets are scored by comparing with tweet, user interest model has to be interpreted by includ- user’s model as explained in Section 2.3 and then ranked. ing the ranking score factor of the friends. Every followee’s Meanwhile, each user is asked to classify and score tweets named entities and their appearance counts are first multi- in GN RL and P SN L datasets. Volunteered users made a plied with the friend’s ranking, and then summed. There- two-step evaluation on each tweet for each dataset. They fore, a set of named entities with their scores based on the are asked to mark the tweet as interesting or uninterest- user interest model is obtained. The mathematical interpre- ing, and then if the tweet is interesting, they are asked to tation to calculate the score of a single named entity is given score the tweet in the range of [1 3] where 1 is the least in Equation 3, where SCN E represents the overall score of a score, and 3 is the highest score for interestingness. In the 19 · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015 Classification Acc. (%) Ranking Acc. (nDCG) GN RL P SN L GN RL P SN L U ser1 47 49 0.520 0.612 U ser2 42 39 0.573 0.654 Inactive Users U ser3 36 37 0.433 0.478 U ser4 43 36 0.322 0.301 U ser5 49 47 0.567 0.514 Average (IU) 43.40 41.60 0.483 0.512 U ser6 68 64 0.777 0.909 U ser7 66 61 0.699 0.768 Active Users U ser8 62 56 0.760 0.782 U ser9 71 72 0.720 0.815 U ser10 72 65 0.601 0.677 Average (AU) 67.80 63.60 0.711 0.790 Average (Overall) 54.10 0.624 Table 1: Tweet Recommendation Experiment Results with respect to the Baseline Method Classification Acc. (%) Ranking Acc. (nDCG) GN RL P SN L GN RL P SN L U ser1 69 66 0.723 0.773 U ser2 62 58 0.684 0.796 Inactive Users U ser3 52 55 0.656 0.616 U ser4 67 52 0.590 0.623 U ser5 72 69 0.734 0.691 Average (IU) 64.40 60.00 0.677 0.700 U ser6 88 86 0.809 0.958 U ser7 79 74 0.795 0.888 Active Users U ser8 74 68 0.812 0.826 U ser9 88 85 0.815 0.904 U ser10 80 77 0.773 0.872 Average (AU) 81.80 78 0.801 0.890 Average (Overall) 71.05 0.767 Table 2: Tweet Recommendation Experiment Results with Respect to the Proposed Method baseline method, followee rankings are neglected and hence pendent on the morphology of the language. Experimental every named entity has equal weight. Generated recommen- results show that the proposed method is capable of decid- dations are compared against the user preferences in terms ing on tweets to be recommended according to the user’s of classification, and ranking. interest. Experimental results show the applicability of the approach for recommending tweets. The results in Table 1 show that the baseline method is able to decide whether a tweet is interesting for a user or not 5. REFERENCES with the accuracy of 54,10% on average with classification [1] D. Downey, M. Broadhead, and O. Etzioni. Locating and 0,624 nDCG value on average with ranking, which are complex named entities in web text. In Proceedings of lower than the results of our system. The performance of the 20th International Joint Conference on Artificial the baseline method in some cases decreases down to 36% Intelligence, IJCAI’07, pages 2733–2739, San Francisco, correct prediction at classification, and 0,322 nDCG value CA, USA, 2007. Morgan Kaufmann Publishers Inc. at ranking quality. On the other hand, the results shown [2] V. Levenshtein. Binary Codes Capable of Correcting in Table 2 shows that the proposed system is able to decide Deletions, Insertions and Reversals. Soviet Physics whether a tweet is interesting for a user or not with the Doklady, 10:707, 1966. accuracy of 71,05% on average for classification and 0,767 [3] C. Li, A. Sun, J. Weng, and Q. He. Exploiting hybrid nDCG value on average for ranking. Given the suitable user contexts for tweet segmentation. In Proceedings of the habits, performance of the system increases up to the 88% 36th international ACM SIGIR conference on Research correct prediction for classification, and 0,958 nDCG value and development in information retrieval, SIGIR ’13, at ranking quality. The comparison of two tables show that pages 523–532, New York, NY, USA, 2013. ACM. the proposed user interest modeling approach increases the performance. [4] C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: named entity recognition in targeted twitter stream. In Proceedings of the 35th international 4. CONCLUSIONS ACM SIGIR conference on Research and development This paper proposes a new approach to Twitter user mod- in information retrieval, SIGIR ’12, pages 721–730, eling and tweet recommendation by making use of named New York, NY, USA, 2012. ACM. entities extracted from tweets. A powerful aspect of NER [5] T. Sezer. TS Corpus, The Turkish Corpus, 2014. approach adopted in this study, tweet segmentation, is that [Online; accessed 14-December-2014]. it does not require an annotated large volume of training [6] Twitter. About twitter, inc., 2014. [Online; accessed data to extract named entities, therefore a huge overload of 14-December-2014]. annotation is avoided. In addition, this approach is not de- 20 · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015