=Paper= {{Paper |id=Vol-2036/T6-4 |storemode=property |title=Microblog Processing: A Study |pdfUrl=https://ceur-ws.org/Vol-2036/T6-4.pdf |volume=Vol-2036 |authors=Sandip Modha |dblpUrl=https://dblp.org/rec/conf/fire/Modha17 }} ==Microblog Processing: A Study== https://ceur-ws.org/Vol-2036/T6-4.pdf
                                          Microblog Processing : A Study
                                                               Sandip Modha
                                  Dhirubhai Ambani Institute of Information and Communication Technology
                                                         Gandhinagar, Gujarat India
                                                          sjmodha@gmail.com.com

ABSTRACT                                                                      has to update query vector by expanding or shrinking query
Sensing Microblog from retrieval and summarization become the                 term.
challenging area for the Information retrieval community. Twitter         iv) Tweet often include abbreviation (e.g. Lol,India written as ind),
is one of the most popular micro blogging platforms. In this paper,           smiley, special character, misspelling (tomorrow is written like
Twitter posts called tweets are studied from retrieval and extrac-            2moro). Tweet normalization is the biggest issue for microblog
tive summarization perspectives. Given a set of topics or interest            processing.
profiles or information requirement, a Microblog summarization             v) On many occasion, it has been found that native language
system is desinged which process Twitter sample status stream                 tweets are in transliterated romanaized English
and generate day-wise, topic-wise tweet summary. Since volume                There are two cases for Microblog summarization [2] [7] (I)Online
of the Twitter public status stream is very large, tweet filtering        summarization or Push notification: novel tweet sent to user in
or relevant tweet retrieval is the primary task for the summariza-        real time where latency is important i.e. how fast we can deliver
tion system. To measure the relevance between tweets and interest         relevant and novel tweet to interested user. (II) Offline summariza-
profiles, Language model with Jelinek-mercer smoothing, Dirichlet         tion (Email digest): At the end of day, system generates topic-wise
smoothing and Okapi BM25 model are used. Behaviour of Language            novel and relevant tweet summary which essentially summarizes
Model smoothing parameter λ for JM-smoothing and µ for dirichlet          what happened that day. In offline summarization, latency is not
smoothing is also studied. Summarization is anticipated as cluster-       important. In this paper,latter case is considered for the experiment
ing problem. TREC MB 2015 and TREC RTS 2016 dataset is used                  Summarization System should include relevant and novel tweet
to perform experiment. TREC RTS official metrics nDCG@10 − 1              in summary. If there are no relevant tweet for a particular interest
and nDCG@10 − 0 are used to evaluate outcome of experiment. A             profile on a specific day, then this day is called silent day for that
detailed post hoc analysis is also performed on experiment results.       interest profile and summarization system should not include any
                                                                          tweet for that particular profile. If system correctly identify such
KEYWORDS                                                                  silent day, then it should be awarded with highest score (i.e.1). If
Microblog, Summarization,Ranking,Language Model, JM smooth-               system include tweet in summary for an interest profile on silent
ing, Dirichlet Smoothing                                                  day, it receive score 0

1    MOTIVATION AND CHALLENGES                                            2   RELATED WORKS
Microblog become popular social media to disseminate or broadcast         Jimmy Lin and Diaz [2] [3] had introduced Microblog track since
the real world event or opinion about the event of any nature. As         2012 with objective to explore new IR methodology on short text.
on 2016, Twitter has 319 million active users across the world1 .         Mosad et.al.[1] has trained their Word2vec model using 4 years
With this large user base, Twitter is the interesting data source for     tweet corpus.They have used Okapi BM25 relevance model to calcu-
the real time information. On many occasion, it has been observed         late the relevance score. To refine the scores of the relevant tweets,
that Twitter was the first media to break the event. Many times,          tweets were re scored using the SVM rank package using the rele-
thousands of users across the world geography interact on same            vance score of the previous stage. Luchen et.al.[7] expanded title
topic or interest profiles with diverse views. Following are the          term each day with point-wise KL-divergence to extract 5 hashtags
major challenges for Microblog summarization. Henceforth, topic           and 10 other terms. For relevance score, they have used unigram
or interest profile will be used interchangeability in rest of paper.     matching formula with different weight to original title terms and
                                                                          expanded terms. Our approach is similar to [6] but we have em-
  i) Since Twitter imposes limitation on length of tweet, it become
                                                                          pirically tuned smoothing parameter for better results. In addition
     very difficult for retrieval system to retrieve tweet without the
                                                                          to this, we have also incorporated two level thresholds which are
     proper context. So, tweet sparseness is the critical issues for
                                                                          computed via grid search which control tweets to be part of the
     the retrieval system.
                                                                          daily summary
 ii) On Many topics, the volume of the tweet is very large. Most of
     the tweets are redundant and noisy.
iii) On Twitter, Some of the topics are being discussed for a longer      3   DATA AND RESOURCES
     period of time. They also diverted into many subtopics (e.g.         TREC has started Microblog Track since 2012. In 2016 track was
     demonetization in India, Refugee in Europe ). It is very difficult   merged with temporal summarization and renamed as Real time
     to track topic drifting for an event. To track topic drifting, one   summarization track [2]. An experiment is performed on TREC RTS
                                                                          2016 dataset[5] and TREC 2015 dataset[3] to evaluate our system
1 https://en.wikipedia.org/wiki/Twitter                                   performance. Table 1 describe statistics of both datasets.
                                      Table 1: TREC RTS 2016 and TREC MB 2015 Dataset description

                                           Dataset Detail                                 TREC RTS 2016               TREC MB 2015
                                 Total Number of Tweets                                       13 Mn                       42 Mn
                              Interest Profiles for evaluation                                  56                          51
                                        Size of Qrels                                         67525                       94066
                                 Number of positive Qrel                                       3339                        8233
                   Number of common Interest profiles between 2 Datasets                        11                          11
                                Tweet download duration                                02-08-16 to 11-08-16       20-07-2015 to 29-07-15


4     PROBLEM STATEMENT                                                            of days namely silent day and eventful day. An eventful day is
Given an interest profile IP ={IP 1 , IP2 , ..IPm }, and tweets T = {T1 ,T2 , ..,Tn } one in which there are some relevant tweets for the given interest
from the Dataset, we need to compute the relevance score between                      profile in a given day. In contrast, a silent day is one for which
tweets and interest profile in order to create profile wise offline sum-              there is no relevant tweet for the given interest profile. The system
mary S = {S 1 ....Sn }. Where Si is the set of i profile-wise day-wise
                                                 t h                                  should not include any tweet in the summary for that day for that
relevant and novel tweets. We can model profile specific summary                      particular interest profile. On a silent day, the system receives a
as below.                                                                             score of one (highest score) if it does not include any tweet in the
   Si = {t 1 , t 2 , .., tn } where ti ,t j ∈T                                        summary for that interest profile and zero otherwise. Detecting
   For given interest profile, Relevance score between tweet and                      a silent day for a profile is a critical task for the summarization
interest profile is greater than specified silent day threshold Ts                    system. The Ranking function is defines as follow
and relevance threshold Tr . In addition to this, these tweets should
be novel i.e. similarity between all tweet of the summary should                                              F (IP,T ) = P(IP |T , R = 1)
less that the novelty threshold Tn . if any tweet ti is included in                       The above equation describe that if tweet is relevant how likely
the summary for a particular profile on a given day then it should                    interest profile would be IP. The term P(IP|T) estimated by language
satisfy following constraints.                                                        model.
     • Length of day-wise summary of Interest profile upto 100
       tweets                                                                         5.4 Summarization Method
     • Sim(ti ,t j ) ≤Tn ∀tj ∈Si (Tn = Novelty threshold)                             To select the top relevant and novel tweets, we have designed a two
                                                                                      level threshold mechanism. At the first level, for any interest profile
5 PROPOSED METHODOLOGY                                                                on any day, if all the tweets ranked under this profile have scores
In this section, we describe our proposed approach to design a                        less than silent threshold Ts , we consider this day as silent day and
Microblog summarization system.                                                       we will not consider any tweet in the interest profile’s summary.
                                                                                      We have empirically set silent day Ts using grid search. In the other
5.1 Query formulation from interest profile                                           case, where we get tweet scores greater than Ts , we normalize the
Interest Profiles are consist of 3-4 word title, sentence long narra-                 tweet scores. We assign value 1 to tweet with highest score and
tive and paragraph length narrative explaining detailed information                   assign relative values to the other tweets in the rage of 0 to 1. We
need [2]. All the terms from title field and named entity from de-                    include all tweets which values    more than Tr 2 normalized score
scription and narrative fields are extracted to generate query. A                     of T r 1 in the range  of  0 to 1   and actual score Tr 1 in our candidate
dictionary is maintained to map named entity with abbreviated                         list and   extract top k  tweets.  thee second level relevance threshold
forms.                                                                                of Tr 1 and Tr 2 is also selected empirically using grid search.
                                                                                      5.4.1 Novelty Detection using Tweet cluster. In this study, Mi-
5.2    Tweet Pre-processing                                                        croblog or Tweet summarization problem is anticipated as a tweet
Tweets and Interest profiles were pre-processed before calculating                 clustering problem. Once all the relevant tweets are retrieved, clus-
the relevance score.Non-English tweets are filtered using language                 ters are formed using jaccard similarity of tweet’s text.Tweets hav-
attribute of tweet object.Non-ASCII characters are removed. Tweet                  ing external URL or tweets having temporal feature in the text are
having external URL embedded with text are expanded and text of                    given priority because such tweets are more informative than the
external URL are merged with tweet text. Tweet without external                    tweet with only text and without external. we have used regular
URL and less than 5 tokens are filtered.                                           expression to extract temporal expression from tweet text.

5.3    Relevance Score                                                             6    RESULTS
To retrieve relevant tweets for a given interest profile, we have im-              To evaluate the performance of the system, Normal discounted
plemented language model with Jelinek Mercer, Dirichlet smooth-                    Cumulative gain, nDCG@10 is computed for each day for each in-
ing with parameters and µ respectively. In addition to this,we have                terest profile and is averaged across them [2].There are two variant
also used BM25 ranking model to tank tweets. There are two types                   namely: nDCG@10-1, nDCG@10-0. In nDCG@10-1 [8], on silent
                                                                               2
Table 2: Result on TREC RTS 2016 with different ranking                      Table 3: Result Comparison with TREC RTS 2016 top team
function using grid search
                                                                                 metric         our          COMP2016 QU                Blank
      Ranking function             ndcg10-1       ndcg@10-0                                     result                                  run
      Language Model with jm       0.3317         0.0998                         ndcg@10-1      0.3524       0.2898         0.2621      0.2339
      smoothing                                                                  ndcg@10-0      0.1131       0.0684         0.030       0
      Language Model with          0.3384         0.1116
      Dirichlet smoothing
      Okapi BM25                   0.3524         0.1131                                    Table 4: Result on TREC MB 2015

                                                                                     team                                     nDCG@10
day, system receive score 1 if it does not include any tweet in the                  our results LM with jm smoothing         0.2676
summary for the particular interest profile and 0 otherwise. How-                    NUDTSNA                                  0.3670
ever, in nDCG@10-0, for a silent day, system receives gain zero                      CLIP CMU                                 0.2492
irrespective of what is produced [2]. Our goal is to maximize the
value of nDCG@10-0 and nDCG@10-1 jointly, which gives a wider
picture, by tuning parameter and Ts in case of language model                7     POST HOC ANALYSIS
with JM smoothing and µ and Ts in case of Dirichlet smoothing.
                                                                             In this section, we discuss comprehensive performance analysis
    While analyzing the evaluation metrics nDCG@10−1 and nDCG@
                                                                             of the summarization system from various perspectives. Since the
10-0 on TREC RTS 2016 [2] [8], our system had failed in some of the
                                                                             massive dataset is used in the experiment, Tweet Selection or Tweet
interest profiles like RTS37(Sea World), MB265(cruise ship mishaps),
                                                                             filtering is the primary task of the summarization system. Since
MB365(cellphone tracking) where we could detect some of the silent
                                                                             Twitter restrict length of tweet,Tweet sparseness is the biggest
days and had obtained some score in the nDCG@10-1 metric but
                                                                             challenge of the relevant tweet retrieval.
did not score in the nDCG@10-0 metric. This is why we look at
                                                                                 Interest Profiles are consist of 3-4 word title, sentence long narra-
both the metrics while evaluating our system. The TREC RTS 2016
                                                                             tive and paragraph length narrative explaining detailed information
[5] organizers had considered nDCG@10-1 which adds gain on
                                                                             need [2]. The crucial part is how do we generate query from triplet
silent as well as eventful day as a primary metric to rank various
                                                                             as shown in Table 1. Luchen et al.[7] reported that title keyword
teams. However, ndcg-0 which reflects how many relevant and
                                                                             play critical role in the retrieval. Our experiment also support these
novel tweets are part of the daily summary and does not add gain
                                                                             findings.
on silent day is also very important. In our analysis, it was observed
                                                                                 The objective of the summarization system is to identify all the
that TREC RTS 2016 result[5] shows that empty run i.e. blank file
                                                                             clusters formed across the given period for all the interest profiles
with zero tweets scored nDCG@10 − 1 = 0.2339 which is more than
                                                                             and should not include any tweet if the given day is silent for any
average score of all the teams so is not a very accurate measure of
                                                                             interest profile. Performance of Summarization system depends
judging. COMP2016 team [4] receive score nDCG@10 − 1 = 0.2898
                                                                             upon 2 task (i) Relevant tweet retrieval (ii) Novelty detection across
and nDCG@10 − 0 = 0.0684. So it shows that 76 percent of the
                                                                             relevant tweet.
nDCG@10 − 1 score obtained by system is by remaining silent. In
this experiment, we have tried to tune parameters which maximize
                                                                             7.1    Interest Profile characteristics
nDCG-1 and nDCG-0 jointly. We believe that nDCG@10-0 is a very
important metric which indicate that how much relevant and novel             During post hoc analysis, It has been observed that interest profile
tweets were included in the summary. We report our best result               have different characteristic. Some of the interest profiles have
with nDCG@10-1=0.3524 and nDCG@10-0=0.1131. without any                      spatial restriction. For example bus Service to NYC, gay marriage
sort of query expansion substantially outperform top team [4] in             laws in Europe, job training for high school graduates US. For
TREC RTS 2016[2]. Improvement in ndcg@10-0 shows that we have                some interest profile such spatial restriction is not applied; user
added more relevant tweet in interest profile summary which is               information is spread across the globe. E.g. emerging music styles,
better in a lot of senses                                                    adult summer camp, hidden icons in movies and television
    Table 2 shows system result with all standard ranking algorithm.            Generalized interest profiles have many silent day and interest
Results show that all the ranking function perform in line with              profile with spatial named entity have more relevant tweet. Named
respect to each other, though Okapi BM25 model marginally outper-            entity play a very crucial role in relevant tweet retrieval. Some of
forms language model. Our result on language model with Dirichlet            the title of interest profile does not include NE so we extracted
smoothing and JM-smoothing outperforms result reported by [6].               NE from narrative field and included in query. Interest profile or
The factor behind this outperformance is we have chosen parameter            query which does not have named entity as query term perform
λ = 0.1 and µ = 1000 and two level threshold mechanism. suwaileh             very badly in result metric e.g. emerging music style.
et. al. [6] have set λ=0.7 and µ= 2000. Table 3 shows the 25 percent
improvement in the results reported by top team of TREC RTS 2016             7.2    Named Entity Linking Problem
[4][5]. Table 4 shows system result on TREC MB 2015 Dataset [3].             Interest profile some time contain very generalize Named Entity.
Here thresholds are decided empirically not through grid search.             E.g. legalizing Medical Marijuana US and matching tweet contain
                                                                         3
a Named Entity Florida (Florida Medical Association to oppose                 8     CONCLUSION
medical marijuana ballot amendment in Florida). Due NE linking                In this paper, we presented summarization system using language
problem relevant tweet score less against the interest profile.               model with JM smoothing ,Dirichlet smoothing and Okapi BM25
                                                                              model. Results show that All the ranking function perform in line
7.3    Named Entity Normalization                                             with respect to each other. Though Okapi BM25 model marginally
                                                                              outperform language model. We have perform grid search to de-
Due to limitation in length of tweet, Microblog user often writes
                                                                              termine optimal silent threshold Ts and relevance threshold Tr .We
named entity in abbreviated form. E.g. DEA(Drug Enforcement
                                                                              have also identify smoothing parameter λ =0.1 for Language Model
Agency). Though we have term like drug enforcement agency but
                                                                              with JM smoothing and in the case of dirichlet smoothing µ = 1500
we can not retrieve tweet with above normalize named entity.
                                                                              for better results. We showed that by effectively choosing parameter
                                                                              λ and µ, we can outperform the result obtained by [6].
7.4    Clustering Issues
Since Tweet summarization is multiple document summarization                  9     CURRENT WORK
problem, each tweet along with external URL is considered as one              TREC RTS metric give more emphasize to precision rather than
document. Since Twitter is the crowdsourcing platform, many user              recall. query expansion may include non relevant tweet in the
report same event with different facts. So our novelty detection              summary thus it improve recall but precision decrease substantially
algorithm fails to cluster all following in tweet in same cluster.            and produce adverse effect on the results. Relevance thresholds
                                                                              are very critical for the summarization system for selection of
T1 : Woman Is Eaten Alive By A Tiger At A Safari Park                         tweet in the day-wise topic-wise summary. After doing careful
T2 : Woman attacked by a tiger when she gets out of her car in a              analysis on TREC MB 2015 and TREC RTS 2016 dataset, we found
safari                                                                        that non relevant tweets have score more than relevant tweets in
T3 : Horror at Beijing Safari World as tigers attack women who                many occasions. It gives intuition for designing machine learning
exited car, killing one, injuring another                                     technique or deep neural network to estimate silent day threshold
                                                                              Ts S and relevance threshold Tr . As of now, we are working on
                                                                              following hypothesis.
                                                                                 H1: we can predict threshold for new dataset (TREC RTS 2016)
7.5    Inclusion of Conditional event in Interest                             from old data set TREC 2015 dataset.
       Profile
For the Interest profile like cancer and depression, our system                 Some of the interest profiles are common in both Datasets. Based
performs very badly. Here user is looking for patient suffering from          upon this fact,we have designed following hypothesis.
depression after diagnosed with cancer. It is very difficult to judge
co-occurrence of both events in the tweet.                                       H2: Irrespective of same topic or different topic, statistical fea-
                                                                              tures of the rank list can be exploited to predict silent day relevance
                                                                              threshold Ts and relevance threshold Tr
7.6    Inclusion of Sentiment in Interest profile
Interest profile, like Restaurant Week NYC includes sentiment and                As of now, I am working on machine learning model for estima-
opinion or recommendation. Some of the tweet which are matching               tion of these thresholds for any Dataset downloaded from Twitter.
but does not include sentiment perspective are marked as non-
relevant. In future we have to keep hidden feature like sentiment             REFERENCES
to increase the score of low score non-relevant tweet.                        [1] Mossaab Bagdouri and Douglas W Oard. 2015. CLIP at TREC 2015: Microblog and
                                                                                  LiveQA.. In TREC.
                                                                              [2] Luchen Tan Richard McCreadie Ellen Voorhees Jimmy Lin, Adam Roegiest and
7.7    Hash-tag Identification                                                    Fernando Diaz. [n. d.]. TREC RTS 2016 Guidelines. http://trecrts.github.io
                                                                              [3] Yulu Wang Garrick Sherman and Ellen Voorhees Jimmy Lin, Miles Efron. [n.
Hash-tag can be one of the features, for relevant tweet identification.           d.]. TREC 2015 Microblog Track: Real-Time Filtering Task Guidelines. https:
Relevant Hashtag identification will increase the score of relevant               //github.com/lintool/twitter-tools/wiki/TREC-2015-Track-Guidelines
                                                                              [4] Haihui Tan Dajun Luo Wenjie Li. [n. d.]. PolyU at TREC 2016 Real-Time Summa-
tweet, e.g. key word is sea world and hash tag is #seaworld or self               rization. ([n. d.]).
driving car the relevant hash-tag is #selfdrivingcar                          [5] Jimmy Lin, Adam Roegiest, Luchen Tan, Richard McCreadie, Ellen Voorhees, and
                                                                                  Fernando Diaz. 2016. Overview of the TREC 2016 real-time summarization track.
                                                                                  In Proceedings of the 25th Text REtrieval Conference, TREC, Vol. 16.
                                                                              [6] Reem Suwaileh, Maram Hasanain, and Tamer Elsayed. 2016. Light-weight, Con-
7.8    Effect of Query Expansion                                                  servative, yet Effective: Scalable Real-time Tweet Summarization.. In TREC.
It has been observed that interest profile not having proper named            [7] Luchen Tan, Adam Roegiest, Charles LA Clarke, and Jimmy Lin. 2016. Simple
                                                                                  dynamic emission strategies for microblog filtering. In Proceedings of the 39th
entity, our system perform very badly in terms of evaluation metric               International ACM SIGIR conference on Research and Development in Information
nDCG-1 and nDCG-0 in majority cases. We also hypotheses that                      Retrieval. ACM, 1009–1012.
query expansion might work positively for these interest profiles.            [8] Luchen Tan, Adam Roegiest, Jimmy Lin, and Charles LA Clarke. 2016. An explo-
                                                                                  ration of evaluation metrics for mobile push notifications. In Proceedings of the 39th
Our result shows that query expansion for such topic improvise                    International ACM SIGIR conference on Research and Development in Information
the result nDCG-1 and nDCG-0. One can do query expansion bases                    Retrieval. ACM, 741–744.
upon interest profiles or case 2 case basis.
                                                                          4