=Paper=
{{Paper
|id=Vol-2621/CIRCLE20_12
|storemode=property
|title=Retrospective Tweet Summarization: Investigating Neural Approaches for Tweet Retrieval
|pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_12.pdf
|volume=Vol-2621
|authors=Lila Boualili,Lynda Said Lhadj,Mohand Boughanem
|dblpUrl=https://dblp.org/rec/conf/circle/BoualiliLB20
}}
==Retrospective Tweet Summarization: Investigating Neural Approaches for Tweet Retrieval==
<pdf width="1500px">https://ceur-ws.org/Vol-2621/CIRCLE20_12.pdf</pdf>
<pre>
        Retrospective Tweet Summarization: Investigating Neural
                    Approaches for Tweet Retrieval
                    Lila Boualili                                      Lynda Said Lhadj                             Mohand Boughanem
              lila.boualili@irit.fr                                     l_said_lhadj@esi.dz                       mohand.boughanem@irit.fr
       IRIT, University of Paul Sabatier                                        ESI                             IRIT, University of Paul Sabatier
               Toulouse, France                                           Algiers, Algeria                             Toulouse, France

ABSTRACT                                                                             intent expressed in the query for two main reasons : The bag-of-
While being a valuable source of information, Twitter can be over-                   words representation of the tweets is not sufficient to capture their
whelming given the volume and the velocity of the information                        semantics. Moreover, the term frequency is inefficient in the case
being published. Thus, an automatically generated summary con-                       of tweets because of their limited length (280 characters) where a
taining relevant tweets and covering the key aspects of the user                     term rarely appears more than ones.
query could be of great interest. However, dealing with tweets                          To tackle these issues a two-stage approach is followed. First, we
presents challenging issues such as the redundancy of information,                   investigate Deep Neural (DN) models to retrieve relevant tweets.
their limited length and informal style. To address these issues, we                 Their interest lies, on one hand, on their ability to learn a complex
follow a two-step approach. First, we retrieve the top-k relevant                    task such as ranking from raw text inputs. On the other hand, it
tweets with respect to the user topic. Deep neural (DN) models are                   is known that relevance in IR is vague and difficult to estimate
mainly investigated for their ability to learn the relevance function                since it is the result of a complex cognitive process. Second, as
from the raw text input. Meanwhile, the distributed representations                  relevant tweets are generally redundant, we address this problem
of tweets can reduce the semantic gap between the tweets and the                     by clustering the retrieved tweets so that similar ones are grouped
topic. Then, the relevant tweets are clustered according to their                    in the same cluster and only a representative tweet per cluster will
similarity and the representative tweet of each cluster is included in               be included in the summary. We focus in this paper on some exist-
the summary. Experiments on the TREC Real-Time Summarization                         ing DN models [4, 6, 9, 11] with different language representation
(RTS) task have shown that DN models are promising and can even                      models (Word2Vec [8], GloVe [12]) and empirically select the best
surpass the performance of a traditional IR model (BM25).                            performing one on the TREC Real Time Summarization (RTS) col-
                                                                                     lections. The obtained results show that estimating relevance using
KEYWORDS                                                                             a DN model is promising and can even surpass the performance of
                                                                                     a classic IR model (BM25). In addition, our results on the scenario
Tweet summarization, tweet retrieval, deep neural models, rele-
                                                                                     B of TREC-RTS compete with the results of the second-best run
vance
                                                                                     of the 2016 campaign. Our code is available for reproducing our
                                                                                     experiments and future work .
1    INTRODUCTION                                                                       The rest of the paper is organized as follows. In section 2, we
Twitter is becoming an undeniable source of real-time information                    review some related work. In section 3, we describe our two-step
providing in many cases the latest news, sometimes even before                       approach. Finally, in section 4 we present the experimental setup
traditional media, especially when it comes to unpredictable events                  we use and discuss our results.
such as natural disasters. Following an ongoing event can be diffi-
cult due to the large amount of information produced with a high                     2    RELATED WORK
velocity on a wide range of topics. In 2019, 500M tweets were pub-
                                                                                     The dominant approach for tweet summarization consists in two
lished every day, that is, on average, 6000 tweets every second.
                                                                                     steps, by first selecting a list of the top-k relevant tweets, and then
Providing users with automatically generated summaries about
                                                                                     discarding redundant ones. The first step relies on query-tweet
their topics of interest is an interesting solution to prevent them
                                                                                     relevance weighting while the second uses tweet-tweet similar-
from being overloaded with irrelevant and redundant tweets. How-
                                                                                     ity measures. Sharifi et al.[13] have proposed the HybridTF-IDF
ever, tweet summarization requires handling the particular nature
                                                                                     approach where the overall set of tweets is considered as a docu-
of tweets that (1) does not necessarily use the same vocabulary as
                                                                                     ment for Term Frequency (TF) calculation in order to overcome
the user’s interest topic expressed in a query, (2) tweets and queries
                                                                                     the tweet length problem. Top-weighted tweets are iteratively in-
are of limited length making the relevance estimation based on
                                                                                     cluded in the summary if their cosine similarity with tweets already
term frequency ineffective, and at last (3) are highly redundant. Sev-
                                                                                     selected is under an empirically predefined threshold . Sumbasic,
eral models have been proposed to tackle the tweet summarization
                                                                                     a term-frequency based method initially proposed for document
problem [5, 13–15, 18]. Most of these models generates summaries
                                                                                     summarization, has proven to be also efficient for tweet summariza-
by retrieving the most relevant tweets then discarding the redun-
                                                                                     tion. [18] have proposed one of the first summarization approaches
dant ones using classical IR models also called bag-of-words models.
                                                                                     for monitoring the live tweet stream for scheduled events. Term
Accordingly, the retrieved tweets do not always match the search

"Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0)."                             https://github.com/BOUALILILila/NeuralTweetSummarization
frequency is used to measure the relevance of a tweet w.r.t to the     3.2     Relevance estimation
event and Kullback-Leibler divergence [7] to reduce redundancy.        The deep neural models we have investigated are: MatchPyramid
   The introduction of TREC Real-Time Summarization (RTS) eval-        [11], DRMM [4], ARC-II [6] and DUET [9]. We give in the following
uation campaigns has led to the development of several models. In      a global overview of each model. MatchPyramid [11]. Models the
the best run of the scenario B[15], the tweet relevance estimation     query-tweet matching as an image recognition process. A matching
is based on the query term frequency in both tweet text and web-       matrix representing the similarities between the query-tweet words
pages linked to the tweet. The tweet similarity is measured by their   is constructed and viewed as an image. A neural network convolu-
common vocabulary. The second best performing run [17] used a          tional is then used to capture rich matching patterns layer by layer.
combination of the social importance of the tweet with its relevance   MatchPyramid can thus identify salient hierarchical signals from
score. The social importance is obtained with a logistic regression    n-gram to n-terms.
model on social attributes such as the number of followers while          DRMM [4]. A relevance matching DN model for ad hoc retrieval
the relevance is a combination of BM25, TF-IDF and cosine similar-     integrating query term importance and local match signals between
ity. In TREC RTS 2017, the best run [5] evaluates tweets relevance     query-tweet words. The model uses matching histograms to capture
using a language model combining the tweet and its linked web          exact match signals.
pages content. The second best performing model [16] proposed             ARC-II [6]. Evaluates the query-tweet matching based on an
to lineally combine several relevance scores such as cosine simi-      interaction matrix between their words. A convolutional network
larity, IDF weights and negative KL-divergence language model.         with max-pooling, that is capable of capturing and preserving the
These models Aiming at reducing the gap between the user intent        order of local features in the interaction matrix, is then used to
expressed in a query and tweets, our work focuses on the tweet         evaluate the matching score.
relevance estimation where we investigate deep neural models.             DUET [9]. Relies on both lexical and semantic matching signals
                                                                       in order to evaluate the relevance of a tweet w.r.t to a query. It uses
                                                                       two DN networks, one for each signal type. The final relevance
                                                                       score is the sum of the of the two network scores.
3     TWEET SUMMARIZATION APPROACH
                                                                          At the end of this step, a list of candidate relevant tweets are
We present in this section the IR approach we follow in the context    ranked according to their relevance score w.r.t to the user topic
of tweet summarization. We assume that a summary is a set of the       (query). As a relevant information is widely shared, the top selected
top non-redundant tweets that are relevant to the user’s interest      tweets can contain redundant information. Thus, a further step
topic. First, we attempt to reduce the semantic gap between queries    reducing the redundancy is required.
and tweets using distributed representations. Then, we investigate
deep neuronal models – that are capable of learning a relevance        3.3     Redundancy reduction
function from the text inputs without further hand crafted features–
to retrieve candidate relevant tweets. This overcomes the ineffec-     Two tweets are considered redundant if they carry the same in-
tiveness of term-frequency based models for short text like tweets.    formation. In order to discard relevant tweets, we use a clustering
As the resulting candidate list of tweets may contain redundant        method to group similar tweets. To measure the similarity between
information, clusters of similar tweets are created using their rep-   tweets, we use their distributed representations since similar words
resentations in the latent space. The summary is then constituted      have close representations in the latent space. We obtain clusters of
from the representative tweet of each cluster. We detail each step     equivalent tweets in terms of information and we select the most
of our approach in this section.                                       relevant one (having the highest relevance score) per cluster to
                                                                       build the summary.

                                                                       4     EXPERIMENTS
3.1    Text representation                                             In order to assess our assumptions, we conduct a set of experiments
The contextualization of words offered by the embeddings can en-       aiming at the following objectives:
rich tweet representation and contribute to reduce the semantic              • Studying the impact of language representation models on
gap between the query and the tweets. In this work, we investigate             DN models for tweet retrieval.
different distributed text representation models widely adopted in           • Comparing the effectiveness of these DN models with a
Natural Language Processing in general and IR particularly, be-                traditional IR baseline.
cause of their efficiency, for example: Word2Vec [8], GloVe [12] and         • Investigating clustering as a redundancy reduction method.
Fasttext [1]. These models use shallow neural networks to learn
word representations from their contexts. Specifically, Word2Vec       Finally, we compare our global tweet summarization method with
has two configurations: SkipGram that takes a word as input and        prior work [10, 13] and official runs of the TREC RTS 2016 and 2017
tries to predict its context, while CBOW tries to predict the word     [5, 15–17].
from its context. Fasttext is a framework that learns word repre-
sentations like Word2Vec but at the n-gram level to overcome the       4.1     Experimental Setup
words out-of-vocabulary problem. GloVe uses word co-occurrence         Dataset. We used the replay mechanism of the scenario B over the
statistics in the corpus in order to learn the word representation     tweets collected during the evaluation period of the TREC RTS 2016
from both its local and global context.                                and 2017 campaigns. This scenario consists in identifying up to 100
ranked tweets per day and per topic. These tweets are sent to the              Table 1: Local embeddings pre-training collection.
user daily. The TREC RTS 2016 collection consists of 203 topics, but
only 56 were assessed along with 44, 566 relevance judgements. For          Tweet count     word count     vocabulary size (unique words)
the 2017 dataset, 97 topics were assessed with only 39, 106 relevance
                                                                             45,798,044     321,658,075               7,707,693
judgements. Each topic includes a title, a description and a narrative.
In these experiments, we have used only the title having the form
closely related to real user queries, ie with only the important key      Table 2: BEST MAP percentage results on the TREC RTS 2016
words.                                                                    dataset for each model ranked from best to worst.
   Evaluation metrics. We use standard MAP and precision at
10 (P@10) for evaluating the tweet retrieval component. Then we                 Model              Embeddings         Local    MAP(%)
use the official evaluation metrics of TREC RTS campaigns that
are variants of the NDCG metric namely: nDCG0, nDCG1 and nD-                    MatchPyramid       Fasttext-CBOW        ✓         36.13
CGp to evaluate the overall summarization system. The difference                DUET               Fasttext-CBOW        ✓         29.32
between these metrics resides in the penalty the system receives                ARC-II             CBOW                 ✓         28.61
when it sends tweets in a "silent day" for a given topic, where there           DRMM               GloVe-tweets         -         08.06
is actually no relevant tweets for the topic. During a silent day,
nDCG0 gives a gain of 0 to the system that sends tweets, nDCG1                            Table 3: MatchPyramid vs. BM25
rewards the system that did not push any tweet with a perfect gain
of 1 else 0, while nDCGp –introduced in the 2017 campaign– gives
                                                                                                   TREC RTS 2016      TREC RTS 2017
a penalty that goes from 0 to 1 according to the number of tweets                Model
                                                                                                   MAP    P@10        MAP    P@10
pushed by the system.
   Tweet Processing. Before relevance estimation, potential irrel-               BM25              13.44     23.75    26.83       34.69
evant tweets are filtered. Each tweet with less than 5 tokens is                 MatchPyramid      36.13     28.93    32.20       24.00
considered too short to carry any information, thus it is automat-
ically discarded. This yields to reduce the number of candidates
tweets and to decrease the computational complexity. We also filter       embeddings. DRMM performs poorly because it uses frequency
tweets that have more than one URL and more than three hash-              histograms that are more efficient for long documents rather than
tags. Theoretically, such tweets are supposed to highlight the tweet      short documents like tweets.
subject. However, considering the tweet length, more than three
                                                                          4.2.2 Contribution of DN models over a traditional IR baseline. In
subjects indicates low quality rather highlighting the key topic of
                                                                          order to evaluate the effectiveness of DN models, we compare the
the tweet.
                                                                          MatchPyramid model with the traditional BM25 baseline to as-
   Deep Neural models. In these experiments, we rely on the
                                                                          sess its effectiveness for tweet retrieval. Table 3 shows the results
open source implementation made available in MatchZoo. The aim
                                                                          obtained on the TREC RTS 2016 and 2017 datasets. For the 2016 col-
of this platform is facilitating the design, comparing and sharing
                                                                          lection, MatchPyramid clearly outperforms BM25 in terms of MAP
deep text matching models. In addition, a five-fold cross validation
                                                                          and P@10. For the 2017 collection, we notice that MatchPyramid is
is used to evaluate the DN models in all experiments in order to
                                                                          able to retrieve more relevant tweets than BM25, however, it has a
use all the data.
                                                                          lower P@10 indicating that it has less capacity to rank these tweets
                                                                          in the top-10.
4.2    Results and discussion
4.2.1 Impact of language representation models on DN models. To           4.2.3 Evaluation of the proposed tweet summarization method. In
evaluate the effectiveness of the continuous-word representation          this experimentation, we compare our overall approach with a prior
for our task, we used locally pre-trained embeddings and global           work on Microblog summarization and the two best official runs of
pre-trained embeddings as input for the DN models : MatchPyramid          TREC RTS 2016 and 2017. The main results are reported in Table 4.
[11], DRMM [4], ARC-II [6] and DUET [9].                                     TREC RTS 2016. The results show that our approach outper-
   For local embeddings, we have pre-trained the two configura-           forms the second best TREC RTS-2016 submission with an improve-
tions of Word2Vec: (a) Skip-Gram (SG) and (b) CBOW with Fasttext          ment of more than 34.78% in terms of nDCG0@10. We also notice
using 3-grams that is : (a) FastText-SG and (b) FastText-CBOW on          an improvement of 4.5% in terms of nDCG0@10 over the best TREC
tweets collected before the evaluation period of TREC RTS cam-            RTS-2016 submission. We recall that the nDCG1@10 measures re-
paigns by [3]. Table 1 shows the statistics of the local collection.      wards, with a gain of 1, any system that does not push tweets in a
   For Global embeddings, we used Word2Vec pre-trained on Google          silent day. The inconvenient of this measure is that a system can
News Corpus, GloVe pre-trained on tweets and GloVe pre-trained            have a non-zero score for an empty submission (no results returned
on Wikipedia. Table 2 reports the best results in term of MAP we          on any given day). Indeed, it is difficult for a system to obtain a
obtained on the TREC RTS 2016 set. For our task, local embeddings         high gain (close to 1) while an empty submission allows the system
seem to yield better performance overall the models, the best re-         to easily have a gain of 1 on silent days, knowing that 30.89% of
sult has been achieved by MatchPyramid on local Fastext-CBOW              the days of the RTS-2016 campaign are silent.
                                                                             TREC RTS 2017. We notice that our approach does not perform
https://github.com/NTMC-Community/MatchZoo                                well on this collection. This can be explained by two reasons: The
Table 4: Evaluation on the TREC RTS replay mechanism of                                    [7] Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency.
scenario B. Metrics consider only the top-10 results (@10).                                    The annals of mathematical statistics 22, 1 (1951), 79–86.
                                                                                           [8] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
                                                                                               estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
                                  TREC RTS 2016              TREC RTS 2017                     (2013).
    Model                                                                                  [9] Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to match using
                                 nDCG0 nDCG1                nDCG1 nDCGp                        local and distributed representations of text for web search. In Proceedings of the
                                                                                               26th International Conference on World Wide Web. 1291–1299.
    Ours                           07.13        27.13         13.18        24.26          [10] Ani Nenkova and Lucy Vanderwende. 2005. The impact of frequency on summa-
    TF-IDF [2]                     08.34        17.45         25.70        31.66               rization. Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005 101
                                                                                               (2005).
    HybridTF-IDF [13]              07.67        16.78         24.97        30.95          [11] Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng.
    Sumbasic [10]                  05.36        16.55         24.24        30.22               2016. Text matching as image recognition. In Thirtieth AAAI Conference on
    PolyURunB3 [15]                06.84        28.98           -            -                 Artificial Intelligence.
                                                                                          [12] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe:
    nudtsna [17]                   05.29        27.08           -            -                 Global Vectors for Word Representation. In Empirical Methods in Natural Lan-
    HLJIT_qFB_url [5]                -            -           29.10        36.56               guage Processing (EMNLP). 1532–1543. http://www.aclweb.org/anthology/D14-
                                                                                               1162
    PKUICSTRunB1 [16]                -            -           30.03        34.83          [13] Beaux Sharifi, Mark-Anthony Hutton, and Jugal K Kalita. 2010. Experiments in
                                                                                               microblog summarization. In 2010 IEEE Second International Conference on Social
                                                                                               Computing. IEEE, 49–56.
                                                                                          [14] Lidan Shou, Zhenhua Wang, Ke Chen, and Gang Chen. 2013. Sumblr: continuous
first one concerns the candidate relevant tweets retrieval. We think                           summarization of evolving tweet streams. In Proceedings of the 36th international
                                                                                               ACM SIGIR conference on Research and development in information retrieval. 533–
that the relevance estimation using MatchPyramid was not able to                               542.
outperform BM25 in terms of P@10. Considering a poor candidate                            [15] Haihui Tan, Dajun Luo, and Wenjie Li. 2016. PolyU at TREC 2016 Real-Time
list of tweets, it is difficult to construct a good summary. The second                        Summarization. In Proceedings of The Twenty-Fifth Text REtrieval Conference,
                                                                                               TREC 2016, Gaithersburg, Maryland, USA, November 15-18, 2016, Ellen M. Voorhees
reason may be related to the silent days issue that we did not handle                          and Angela Ellis (Eds.), Vol. Special Publication 500-321. National Institute of
in this work.                                                                                  Standards and Technology (NIST).
                                                                                          [16] Jizhi Tang, Chao Lv, Lili Yao, and Dongyan Zhao. 2017. PKUICST at TREC 2017
                                                                                               Real-Time Summarization Track: Push Notifications and Email Digest. In Pro-
5     CONCLUSION                                                                               ceedings of The Twenty-Sixth Text REtrieval Conference, TREC 2017, Gaithersburg,
                                                                                               Maryland, USA, November 15-17, 2017, Ellen M. Voorhees and Angela Ellis (Eds.),
In this work, we have presented an experimental study where we                                 Vol. Special Publication 500-324. National Institute of Standards and Technology
consider tweet summarization as a tweet retrieval task. To over-                               (NIST).
come the issue of the semantic matching between user query and                            [17] Xiang Zhu, Jiuming Huang, Sheng Zhu, Ming Chen, Chenlu Zhang, Zhenzhen Li,
                                                                                               Huang Dongchuan, Zhao Chengliang, Aiping Li, and Yan Jia. 2015. NUDTSNA
tweets, we have investigated the impact of distributed represen-                               at TREC 2015 Microblog Track: A Live Retrieval System Framework for Social
tations and DN models on the retrieval task. The analysis of the                               Network based on Semantic Expansion and Quality Model. In Proceedings of
obtained results have shown that no representation model was best                              The Twenty-Fourth Text REtrieval Conference, TREC 2015, Gaithersburg, Maryland,
                                                                                               USA, November 17-20, 2015, Ellen M. Voorhees and Angela Ellis (Eds.), Vol. Special
for all the evaluated DN models for the TREC RTS task. However,                                Publication 500-319. National Institute of Standards and Technology (NIST).
MatchPyramid model yields the best results with an optimum on                             [18] Arkaitz Zubiaga, Damiano Spina, Enrique Amigó, and Julio Gonzalo. 2012. To-
                                                                                               wards real-time summarization of scheduled events from twitter streams. In
locally pre-trained embeddings with Fasttext model. In addition,                               Proceedings of the 23rd ACM conference on Hypertext and social media. 319–320.
we have shown that this configuration outperforms BM25. We con-
cluded that neural models can effectively learn to retrieve relevant
tweets. Hence, they are less performing for ranking which is chal-
lenging. Since the performance of DN models depends highly on the
amount of training data, in future work, we plan on investigating
weak supervision in order to generate training data at low cost.

REFERENCES
 [1] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016.
     Enriching Word Vectors with Subword Information. CoRR abs/1607.04606 (2016).
     arXiv:1607.04606 http://arxiv.org/abs/1607.04606
 [2] Deepayan Chakrabarti and Kunal Punera. 2011. Event Summarization Using
     Tweets. Proceedings of the Fifth International AAAI Conference on Weblogs and
     Social Media.
 [3] Abdelhamid Chellal. 2018. Event Summarization on Social Media Stream: Ret-
     rospective and Prospective Tweet Summarization. Thèse de doctorat. Université
     Paul Sabatier, Toulouse, France. https://www.irit.fr/publis/IRIS/2018_These-
     CHELLAL.pdf
 [4] Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016. A deep relevance
     matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International
     on Conference on Information and Knowledge Management. 55–64.
 [5] Zhongyuan Han, Song Li, Leilei Kong, Liuyang Tian, and Haoliang Qi. 2017. HLJIT
     at TREC 2017 Real-Time Summarization. In Proceedings of The Twenty-Sixth Text
     REtrieval Conference, TREC 2017, Gaithersburg, Maryland, USA, November 15-17,
     2017, Ellen M. Voorhees and Angela Ellis (Eds.), Vol. Special Publication 500-324.
     National Institute of Standards and Technology (NIST).
 [6] Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neu-
     ral network architectures for matching natural language sentences. In Advances
     in neural information processing systems. 2042–2050.

</pre>