=Paper=
{{Paper
|id=Vol-2036/T6-4
|storemode=property
|title=Microblog Processing: A Study
|pdfUrl=https://ceur-ws.org/Vol-2036/T6-4.pdf
|volume=Vol-2036
|authors=Sandip Modha
|dblpUrl=https://dblp.org/rec/conf/fire/Modha17
}}
==Microblog Processing: A Study==
Microblog Processing : A Study Sandip Modha Dhirubhai Ambani Institute of Information and Communication Technology Gandhinagar, Gujarat India sjmodha@gmail.com.com ABSTRACT has to update query vector by expanding or shrinking query Sensing Microblog from retrieval and summarization become the term. challenging area for the Information retrieval community. Twitter iv) Tweet often include abbreviation (e.g. Lol,India written as ind), is one of the most popular micro blogging platforms. In this paper, smiley, special character, misspelling (tomorrow is written like Twitter posts called tweets are studied from retrieval and extrac- 2moro). Tweet normalization is the biggest issue for microblog tive summarization perspectives. Given a set of topics or interest processing. profiles or information requirement, a Microblog summarization v) On many occasion, it has been found that native language system is desinged which process Twitter sample status stream tweets are in transliterated romanaized English and generate day-wise, topic-wise tweet summary. Since volume There are two cases for Microblog summarization [2] [7] (I)Online of the Twitter public status stream is very large, tweet filtering summarization or Push notification: novel tweet sent to user in or relevant tweet retrieval is the primary task for the summariza- real time where latency is important i.e. how fast we can deliver tion system. To measure the relevance between tweets and interest relevant and novel tweet to interested user. (II) Offline summariza- profiles, Language model with Jelinek-mercer smoothing, Dirichlet tion (Email digest): At the end of day, system generates topic-wise smoothing and Okapi BM25 model are used. Behaviour of Language novel and relevant tweet summary which essentially summarizes Model smoothing parameter λ for JM-smoothing and µ for dirichlet what happened that day. In offline summarization, latency is not smoothing is also studied. Summarization is anticipated as cluster- important. In this paper,latter case is considered for the experiment ing problem. TREC MB 2015 and TREC RTS 2016 dataset is used Summarization System should include relevant and novel tweet to perform experiment. TREC RTS official metrics nDCG@10 − 1 in summary. If there are no relevant tweet for a particular interest and nDCG@10 − 0 are used to evaluate outcome of experiment. A profile on a specific day, then this day is called silent day for that detailed post hoc analysis is also performed on experiment results. interest profile and summarization system should not include any tweet for that particular profile. If system correctly identify such KEYWORDS silent day, then it should be awarded with highest score (i.e.1). If Microblog, Summarization,Ranking,Language Model, JM smooth- system include tweet in summary for an interest profile on silent ing, Dirichlet Smoothing day, it receive score 0 1 MOTIVATION AND CHALLENGES 2 RELATED WORKS Microblog become popular social media to disseminate or broadcast Jimmy Lin and Diaz [2] [3] had introduced Microblog track since the real world event or opinion about the event of any nature. As 2012 with objective to explore new IR methodology on short text. on 2016, Twitter has 319 million active users across the world1 . Mosad et.al.[1] has trained their Word2vec model using 4 years With this large user base, Twitter is the interesting data source for tweet corpus.They have used Okapi BM25 relevance model to calcu- the real time information. On many occasion, it has been observed late the relevance score. To refine the scores of the relevant tweets, that Twitter was the first media to break the event. Many times, tweets were re scored using the SVM rank package using the rele- thousands of users across the world geography interact on same vance score of the previous stage. Luchen et.al.[7] expanded title topic or interest profiles with diverse views. Following are the term each day with point-wise KL-divergence to extract 5 hashtags major challenges for Microblog summarization. Henceforth, topic and 10 other terms. For relevance score, they have used unigram or interest profile will be used interchangeability in rest of paper. matching formula with different weight to original title terms and expanded terms. Our approach is similar to [6] but we have em- i) Since Twitter imposes limitation on length of tweet, it become pirically tuned smoothing parameter for better results. In addition very difficult for retrieval system to retrieve tweet without the to this, we have also incorporated two level thresholds which are proper context. So, tweet sparseness is the critical issues for computed via grid search which control tweets to be part of the the retrieval system. daily summary ii) On Many topics, the volume of the tweet is very large. Most of the tweets are redundant and noisy. iii) On Twitter, Some of the topics are being discussed for a longer 3 DATA AND RESOURCES period of time. They also diverted into many subtopics (e.g. TREC has started Microblog Track since 2012. In 2016 track was demonetization in India, Refugee in Europe ). It is very difficult merged with temporal summarization and renamed as Real time to track topic drifting for an event. To track topic drifting, one summarization track [2]. An experiment is performed on TREC RTS 2016 dataset[5] and TREC 2015 dataset[3] to evaluate our system 1 https://en.wikipedia.org/wiki/Twitter performance. Table 1 describe statistics of both datasets. Table 1: TREC RTS 2016 and TREC MB 2015 Dataset description Dataset Detail TREC RTS 2016 TREC MB 2015 Total Number of Tweets 13 Mn 42 Mn Interest Profiles for evaluation 56 51 Size of Qrels 67525 94066 Number of positive Qrel 3339 8233 Number of common Interest profiles between 2 Datasets 11 11 Tweet download duration 02-08-16 to 11-08-16 20-07-2015 to 29-07-15 4 PROBLEM STATEMENT of days namely silent day and eventful day. An eventful day is Given an interest profile IP ={IP 1 , IP2 , ..IPm }, and tweets T = {T1 ,T2 , ..,Tn } one in which there are some relevant tweets for the given interest from the Dataset, we need to compute the relevance score between profile in a given day. In contrast, a silent day is one for which tweets and interest profile in order to create profile wise offline sum- there is no relevant tweet for the given interest profile. The system mary S = {S 1 ....Sn }. Where Si is the set of i profile-wise day-wise t h should not include any tweet in the summary for that day for that relevant and novel tweets. We can model profile specific summary particular interest profile. On a silent day, the system receives a as below. score of one (highest score) if it does not include any tweet in the Si = {t 1 , t 2 , .., tn } where ti ,t j ∈T summary for that interest profile and zero otherwise. Detecting For given interest profile, Relevance score between tweet and a silent day for a profile is a critical task for the summarization interest profile is greater than specified silent day threshold Ts system. The Ranking function is defines as follow and relevance threshold Tr . In addition to this, these tweets should be novel i.e. similarity between all tweet of the summary should F (IP,T ) = P(IP |T , R = 1) less that the novelty threshold Tn . if any tweet ti is included in The above equation describe that if tweet is relevant how likely the summary for a particular profile on a given day then it should interest profile would be IP. The term P(IP|T) estimated by language satisfy following constraints. model. • Length of day-wise summary of Interest profile upto 100 tweets 5.4 Summarization Method • Sim(ti ,t j ) ≤Tn ∀tj ∈Si (Tn = Novelty threshold) To select the top relevant and novel tweets, we have designed a two level threshold mechanism. At the first level, for any interest profile 5 PROPOSED METHODOLOGY on any day, if all the tweets ranked under this profile have scores In this section, we describe our proposed approach to design a less than silent threshold Ts , we consider this day as silent day and Microblog summarization system. we will not consider any tweet in the interest profile’s summary. We have empirically set silent day Ts using grid search. In the other 5.1 Query formulation from interest profile case, where we get tweet scores greater than Ts , we normalize the Interest Profiles are consist of 3-4 word title, sentence long narra- tweet scores. We assign value 1 to tweet with highest score and tive and paragraph length narrative explaining detailed information assign relative values to the other tweets in the rage of 0 to 1. We need [2]. All the terms from title field and named entity from de- include all tweets which values more than Tr 2 normalized score scription and narrative fields are extracted to generate query. A of T r 1 in the range of 0 to 1 and actual score Tr 1 in our candidate dictionary is maintained to map named entity with abbreviated list and extract top k tweets. thee second level relevance threshold forms. of Tr 1 and Tr 2 is also selected empirically using grid search. 5.4.1 Novelty Detection using Tweet cluster. In this study, Mi- 5.2 Tweet Pre-processing croblog or Tweet summarization problem is anticipated as a tweet Tweets and Interest profiles were pre-processed before calculating clustering problem. Once all the relevant tweets are retrieved, clus- the relevance score.Non-English tweets are filtered using language ters are formed using jaccard similarity of tweet’s text.Tweets hav- attribute of tweet object.Non-ASCII characters are removed. Tweet ing external URL or tweets having temporal feature in the text are having external URL embedded with text are expanded and text of given priority because such tweets are more informative than the external URL are merged with tweet text. Tweet without external tweet with only text and without external. we have used regular URL and less than 5 tokens are filtered. expression to extract temporal expression from tweet text. 5.3 Relevance Score 6 RESULTS To retrieve relevant tweets for a given interest profile, we have im- To evaluate the performance of the system, Normal discounted plemented language model with Jelinek Mercer, Dirichlet smooth- Cumulative gain, nDCG@10 is computed for each day for each in- ing with parameters and µ respectively. In addition to this,we have terest profile and is averaged across them [2].There are two variant also used BM25 ranking model to tank tweets. There are two types namely: nDCG@10-1, nDCG@10-0. In nDCG@10-1 [8], on silent 2 Table 2: Result on TREC RTS 2016 with different ranking Table 3: Result Comparison with TREC RTS 2016 top team function using grid search metric our COMP2016 QU Blank Ranking function ndcg10-1 ndcg@10-0 result run Language Model with jm 0.3317 0.0998 ndcg@10-1 0.3524 0.2898 0.2621 0.2339 smoothing ndcg@10-0 0.1131 0.0684 0.030 0 Language Model with 0.3384 0.1116 Dirichlet smoothing Okapi BM25 0.3524 0.1131 Table 4: Result on TREC MB 2015 team nDCG@10 day, system receive score 1 if it does not include any tweet in the our results LM with jm smoothing 0.2676 summary for the particular interest profile and 0 otherwise. How- NUDTSNA 0.3670 ever, in nDCG@10-0, for a silent day, system receives gain zero CLIP CMU 0.2492 irrespective of what is produced [2]. Our goal is to maximize the value of nDCG@10-0 and nDCG@10-1 jointly, which gives a wider picture, by tuning parameter and Ts in case of language model 7 POST HOC ANALYSIS with JM smoothing and µ and Ts in case of Dirichlet smoothing. In this section, we discuss comprehensive performance analysis While analyzing the evaluation metrics nDCG@10−1 and nDCG@ of the summarization system from various perspectives. Since the 10-0 on TREC RTS 2016 [2] [8], our system had failed in some of the massive dataset is used in the experiment, Tweet Selection or Tweet interest profiles like RTS37(Sea World), MB265(cruise ship mishaps), filtering is the primary task of the summarization system. Since MB365(cellphone tracking) where we could detect some of the silent Twitter restrict length of tweet,Tweet sparseness is the biggest days and had obtained some score in the nDCG@10-1 metric but challenge of the relevant tweet retrieval. did not score in the nDCG@10-0 metric. This is why we look at Interest Profiles are consist of 3-4 word title, sentence long narra- both the metrics while evaluating our system. The TREC RTS 2016 tive and paragraph length narrative explaining detailed information [5] organizers had considered nDCG@10-1 which adds gain on need [2]. The crucial part is how do we generate query from triplet silent as well as eventful day as a primary metric to rank various as shown in Table 1. Luchen et al.[7] reported that title keyword teams. However, ndcg-0 which reflects how many relevant and play critical role in the retrieval. Our experiment also support these novel tweets are part of the daily summary and does not add gain findings. on silent day is also very important. In our analysis, it was observed The objective of the summarization system is to identify all the that TREC RTS 2016 result[5] shows that empty run i.e. blank file clusters formed across the given period for all the interest profiles with zero tweets scored nDCG@10 − 1 = 0.2339 which is more than and should not include any tweet if the given day is silent for any average score of all the teams so is not a very accurate measure of interest profile. Performance of Summarization system depends judging. COMP2016 team [4] receive score nDCG@10 − 1 = 0.2898 upon 2 task (i) Relevant tweet retrieval (ii) Novelty detection across and nDCG@10 − 0 = 0.0684. So it shows that 76 percent of the relevant tweet. nDCG@10 − 1 score obtained by system is by remaining silent. In this experiment, we have tried to tune parameters which maximize 7.1 Interest Profile characteristics nDCG-1 and nDCG-0 jointly. We believe that nDCG@10-0 is a very important metric which indicate that how much relevant and novel During post hoc analysis, It has been observed that interest profile tweets were included in the summary. We report our best result have different characteristic. Some of the interest profiles have with nDCG@10-1=0.3524 and nDCG@10-0=0.1131. without any spatial restriction. For example bus Service to NYC, gay marriage sort of query expansion substantially outperform top team [4] in laws in Europe, job training for high school graduates US. For TREC RTS 2016[2]. Improvement in ndcg@10-0 shows that we have some interest profile such spatial restriction is not applied; user added more relevant tweet in interest profile summary which is information is spread across the globe. E.g. emerging music styles, better in a lot of senses adult summer camp, hidden icons in movies and television Table 2 shows system result with all standard ranking algorithm. Generalized interest profiles have many silent day and interest Results show that all the ranking function perform in line with profile with spatial named entity have more relevant tweet. Named respect to each other, though Okapi BM25 model marginally outper- entity play a very crucial role in relevant tweet retrieval. Some of forms language model. Our result on language model with Dirichlet the title of interest profile does not include NE so we extracted smoothing and JM-smoothing outperforms result reported by [6]. NE from narrative field and included in query. Interest profile or The factor behind this outperformance is we have chosen parameter query which does not have named entity as query term perform λ = 0.1 and µ = 1000 and two level threshold mechanism. suwaileh very badly in result metric e.g. emerging music style. et. al. [6] have set λ=0.7 and µ= 2000. Table 3 shows the 25 percent improvement in the results reported by top team of TREC RTS 2016 7.2 Named Entity Linking Problem [4][5]. Table 4 shows system result on TREC MB 2015 Dataset [3]. Interest profile some time contain very generalize Named Entity. Here thresholds are decided empirically not through grid search. E.g. legalizing Medical Marijuana US and matching tweet contain 3 a Named Entity Florida (Florida Medical Association to oppose 8 CONCLUSION medical marijuana ballot amendment in Florida). Due NE linking In this paper, we presented summarization system using language problem relevant tweet score less against the interest profile. model with JM smoothing ,Dirichlet smoothing and Okapi BM25 model. Results show that All the ranking function perform in line 7.3 Named Entity Normalization with respect to each other. Though Okapi BM25 model marginally outperform language model. We have perform grid search to de- Due to limitation in length of tweet, Microblog user often writes termine optimal silent threshold Ts and relevance threshold Tr .We named entity in abbreviated form. E.g. DEA(Drug Enforcement have also identify smoothing parameter λ =0.1 for Language Model Agency). Though we have term like drug enforcement agency but with JM smoothing and in the case of dirichlet smoothing µ = 1500 we can not retrieve tweet with above normalize named entity. for better results. We showed that by effectively choosing parameter λ and µ, we can outperform the result obtained by [6]. 7.4 Clustering Issues Since Tweet summarization is multiple document summarization 9 CURRENT WORK problem, each tweet along with external URL is considered as one TREC RTS metric give more emphasize to precision rather than document. Since Twitter is the crowdsourcing platform, many user recall. query expansion may include non relevant tweet in the report same event with different facts. So our novelty detection summary thus it improve recall but precision decrease substantially algorithm fails to cluster all following in tweet in same cluster. and produce adverse effect on the results. Relevance thresholds are very critical for the summarization system for selection of T1 : Woman Is Eaten Alive By A Tiger At A Safari Park tweet in the day-wise topic-wise summary. After doing careful T2 : Woman attacked by a tiger when she gets out of her car in a analysis on TREC MB 2015 and TREC RTS 2016 dataset, we found safari that non relevant tweets have score more than relevant tweets in T3 : Horror at Beijing Safari World as tigers attack women who many occasions. It gives intuition for designing machine learning exited car, killing one, injuring another technique or deep neural network to estimate silent day threshold Ts S and relevance threshold Tr . As of now, we are working on following hypothesis. H1: we can predict threshold for new dataset (TREC RTS 2016) 7.5 Inclusion of Conditional event in Interest from old data set TREC 2015 dataset. Profile For the Interest profile like cancer and depression, our system Some of the interest profiles are common in both Datasets. Based performs very badly. Here user is looking for patient suffering from upon this fact,we have designed following hypothesis. depression after diagnosed with cancer. It is very difficult to judge co-occurrence of both events in the tweet. H2: Irrespective of same topic or different topic, statistical fea- tures of the rank list can be exploited to predict silent day relevance threshold Ts and relevance threshold Tr 7.6 Inclusion of Sentiment in Interest profile Interest profile, like Restaurant Week NYC includes sentiment and As of now, I am working on machine learning model for estima- opinion or recommendation. Some of the tweet which are matching tion of these thresholds for any Dataset downloaded from Twitter. but does not include sentiment perspective are marked as non- relevant. In future we have to keep hidden feature like sentiment REFERENCES to increase the score of low score non-relevant tweet. [1] Mossaab Bagdouri and Douglas W Oard. 2015. CLIP at TREC 2015: Microblog and LiveQA.. In TREC. [2] Luchen Tan Richard McCreadie Ellen Voorhees Jimmy Lin, Adam Roegiest and 7.7 Hash-tag Identification Fernando Diaz. [n. d.]. TREC RTS 2016 Guidelines. http://trecrts.github.io [3] Yulu Wang Garrick Sherman and Ellen Voorhees Jimmy Lin, Miles Efron. [n. Hash-tag can be one of the features, for relevant tweet identification. d.]. TREC 2015 Microblog Track: Real-Time Filtering Task Guidelines. https: Relevant Hashtag identification will increase the score of relevant //github.com/lintool/twitter-tools/wiki/TREC-2015-Track-Guidelines [4] Haihui Tan Dajun Luo Wenjie Li. [n. d.]. PolyU at TREC 2016 Real-Time Summa- tweet, e.g. key word is sea world and hash tag is #seaworld or self rization. ([n. d.]). driving car the relevant hash-tag is #selfdrivingcar [5] Jimmy Lin, Adam Roegiest, Luchen Tan, Richard McCreadie, Ellen Voorhees, and Fernando Diaz. 2016. Overview of the TREC 2016 real-time summarization track. In Proceedings of the 25th Text REtrieval Conference, TREC, Vol. 16. [6] Reem Suwaileh, Maram Hasanain, and Tamer Elsayed. 2016. Light-weight, Con- 7.8 Effect of Query Expansion servative, yet Effective: Scalable Real-time Tweet Summarization.. In TREC. It has been observed that interest profile not having proper named [7] Luchen Tan, Adam Roegiest, Charles LA Clarke, and Jimmy Lin. 2016. Simple dynamic emission strategies for microblog filtering. In Proceedings of the 39th entity, our system perform very badly in terms of evaluation metric International ACM SIGIR conference on Research and Development in Information nDCG-1 and nDCG-0 in majority cases. We also hypotheses that Retrieval. ACM, 1009–1012. query expansion might work positively for these interest profiles. [8] Luchen Tan, Adam Roegiest, Jimmy Lin, and Charles LA Clarke. 2016. An explo- ration of evaluation metrics for mobile push notifications. In Proceedings of the 39th Our result shows that query expansion for such topic improvise International ACM SIGIR conference on Research and Development in Information the result nDCG-1 and nDCG-0. One can do query expansion bases Retrieval. ACM, 741–744. upon interest profiles or case 2 case basis. 4