=Paper=
{{Paper
|id=Vol-2078/paper2
|storemode=property
|title=Real-Time Collection of Reliable and Representative Tweets Datasets Related to News Events
|pdfUrl=https://ceur-ws.org/Vol-2078/paper2.pdf
|volume=Vol-2078
|authors=Béatrice Mazoyer,Julia Cagé,Céline Hudelot,Marie-Luce Viaud
|dblpUrl=https://dblp.org/rec/conf/ecir/MazoyerCHV18
}}
==Real-Time Collection of Reliable and Representative Tweets Datasets Related to News Events==
<pdf width="1500px">https://ceur-ws.org/Vol-2078/paper2.pdf</pdf>
<pre>
                          Real-time collection of reliable and
                     representative tweets datasets related to news
                                         events

                     Béatrice Mazoyer1 , Julia Cagé2 , Céline Hudelot3 , and Marie-Luce Viaud1
                                            1
                                            Institut National de l’Audiovisuel,
                                                 Bry-sur-Marne, France
                                          bmazoyer@ina.fr, mlviaud@ina.fr
                                                   2
                                                     Sciences Po Paris
                                                      Paris, France
                                              julia.cage@sciencespo.fr
                     3
                       CentraleSupélec, Mathematics interacting with computer science laboratory,
                                                 Gif-sur-Yvette, France
                                         celine.hudelot@centralesupelec.fr


                         Abstract. This paper is part of a wider work studying the co-influences
                         of Twitter and the production of information by traditional media. A
                         strong prerequisite of this study is to collect, with the limitations of the
                         Twitter API, tweets linked to media events that are representative of
                         the real Twitter activity. This paper describes two proposed approaches
                         to handle this important task. The first one, inspired by information re-
                         trieval, puts the focus on query formulation. It consists on bridging the
                         vocabulary gap between traditional news articles and tweets, by itera-
                         tively modifying the queries sent to the Twitter API depending on the
                         tweets retrieved by previous queries. The second approach consists in
                         streaming a representative sample of all emitted tweets and dynamically
                         clustering them in events. We also discuss approaches to evaluate the
                         collected datasets under the point of view of their representativity of the
                         real activity on Twitter.

                         Keywords: Twitter, news, tweets retrieval


                 1    Introduction
                 How does information propagate online? Does it propagate differently on news
                 websites and on social media? What is the role of social networks and in partic-
                 ular of Twitter in breaking news?
                      In an ideal world, to compare news production on social media and on main-
                 stream media, one would need the universe during a given period of time (e.g.
                 the year 2017) and a geographical location (e.g. France, the UK or the US) of
                 documents published on the one hand on social media and on the other hand
                 on mainstream media. Unfortunately, given the limitation of the Twitter API,
                 it is not possible for the researcher to capture the universe of the documents (or


BroDyn 2018: 1st Workshop on Analysis of Broad Dynamic Topics over Social Media @ ECIR 2018, Grenoble, France   23
tweets) published on Twitter. Does it mean that the research wont be able to an-
swer the previously formulated questions? No. Because, the researcher can rather
use a random sample of the documents, as long as this sample is representative.
Why do we need representativity?
    Assume that you want to answer the following question: what is the prob-
ability for a news story broken on social media to make it to the mainstream
media? With the entire set of news stories broken on social media, it will be
pretty simple to answer this question; but we dont have this data. Now assume
that we get access to a subsample of the documents published on Twitter, but
that this sample is not representative. For example, assume that this sample of
tweets is such that the tweets characteristics (perhaps because the API provides
the researcher with documents tweeted by users with more followers) are such
that these documents have a higher probability to make it to the mainstream
media. Then using this biased subsample will lead the researcher to overestimate
the probability for a news story broken on social media to appear on mainstream
media.
    The same issue will arise if the researcher wants to tackle the follow-up
question: what are the determinants of the success of a news story initially broken
on social media? Imagine that the researcher is using a selected sample of tweets
that is not representative. Imagine for example that this sample of tweets comes
mainly from journalists working for a given media, e.g. Le Monde, and that, at
the same time, within the set of tweets posted by Le Mondes journalists, only the
successful ones are part of the sample, then the results of the empirical analysis
will be biased in favor of Le Monde. In other words, when the researcher will
study the impact of the company for which the journalist work (independent
variable) on the probability for the news story broken on Twitter to make it to
mainstream media (dependent variable), the coefficient obtained for Le Monde
will ovestimate the real causal impact of the company.
    It seems very difficult to correct for this bias. Hence the necessity to have
a representative sample of tweets, i.e. a sample of tweets such that the tweets
included in our sample do not differ from the tweets that are not included along
all the dimensions that may have a direct impact on the dependent variable of
interest.
    In this paper, we aim at collecting, in real-time, tweets linked to French news
that are representative of the total Twitter activity concerning media events.
For now, we only work on tweets in French, which is an advantage in terms of
volume. Indeed, since the Twitter API puts high restrictions on the volume of
tweets returned, we get a largest proportion of emitted tweets by working on a
language that is not very represented on Twitter (French tweets represent 1.8%
of all emitted tweets). Based on the different ways of collecting tweets with the
Twitter API, we explore two types of approaches to collect tweets:

 1. Vocabulary-constrained collection based on keywords extracted from AFP
    dispatches. We use the stream of dispatches from the French news agency
    “Agence France Presse” (AFP). This news agency, like Reuters or Associated
    Press (AP), has a large network of journalists in many countries, that covers


                                                                                      24
    a wide range of news topics. Previous research has shown that this source
    covered 95% of all events covered by French news media in 2013 [1].
 2. Random tweets collection followed by clustering. To get a continuous stream
    of random tweets, we studied Twitter Sample API4 . Indeed, previous works
    show that the tweets from that API are “a representative sample of the true
    activity on Twitter”[2]. However, the number of French tweets in this sample
    is very small (2400 tweets per hour on average). It is likely that small events
    do not generate enough French tweets to be represented in that sample. We
    thus describe, in Section 3.3, an approach to increase this sample of French
    tweets.

2     State of the art
Collecting representative sets of tweets means performing together the tweets
collection and the evaluation of its representativity. We therefore present in this
part: (1) techniques to build sets of relevant tweets and (2) methods to evaluate
tweets sets. A tool designed to collect representative tweets sets should comprise
both aspects.

2.1    Query building strategies
To our knowledge, few approaches have been proposed in the literature to per-
form targeted tweet collection based on news articles. Indeed, the literature re-
lated to the task of linking tweets and news [3–8] mainly used existing datasets,
mostly in English. These articles do not address the issue of real-time collection
of representative tweets. Becker et al. [9] do not directly work on news articles,
but they define query building strategies based on the title, description, time
and location of a selected set of expected events to collect related tweets. Tanev
et al. [10] extract 1-grams and 2-grams from the title and first sentence of news
articles, and weight them depending on their IDF in a one million news articles
collection. They build queries out of these n-grams and explore several query-
expansion strategies based on co-occurrences in their news articles corpus. Ning
et al. [11] build chains of articles concerning the same event and first collect
tweets containing the articles’ urls, then extract a list of top ten keywords from
those tweets and collect tweets containing these keywords. This method is an
attempt to bridge the vocabularies between tweets and news by using keywords
from the tweets. Castillo et al. [12], that share common objectives with our work,
also use urls to retrieve tweets related to news articles. However, our observa-
tions show that most tweets containing a link to an article share only the title
or first sentence of that article, without additional vocabulary. This approach
is restrictive and may lead to miss a large part of Twitter activity concerning
news. Our work aims at building queries that contain Twitter-specific terms
in order to achieve higher recall than previously introduced methods, without
losing precision.
4
    https://developer.twitter.com/en/docs/tweets/sample-realtime/api-reference/get-
    statuses-sample


                                                                                      25
2.2     Representativity of tweets collection
Few papers address the problem of the representativity of the collected tweets:
we could only find three studies addressing this issue [13, 2, 14]. Morstatter et al.
and Wang et al. [13, 14] propose methods to evaluate the Twitter Filter API5 .
This API returns a stream of tweets that match one or several predicates given
as parameters. Both articles point out some biases in the collected datasets. In
another article [2], Morstatter et al. evaluate the Sample API and conclude to
its representativity of the global Twitter activity. However this study does not
consider methods to collect larger representative datasets in a specific language
as we do for French in this paper.


3      Methodology


        Fig. 1. Diagram of our keywords mining approach on three recent events


      We present here two approaches for tweet collection:
 – A tool designed to build queries in real time based on the stream of AFP dis-
   patches and expand them using Twitter specific vocabulary. This approach
   is novel with respect to existing works [9–11] since the step of query expan-
   sion using word embeddings, while developed in other contexts such as web
   retrieval [15, 16], has, as far as we know, never been used in the context of
   tweets retrieval.
 – A method to get a significant and representative set of random tweets in
   French. Our work here aims at maximizing the volume of our random sample,
   while keeping its distribution characteristics the closest to the one of the
5
    https://developer.twitter.com/en/docs/tweets/filter-realtime/api-reference/post-
    statuses-filter


                                                                                        26
      random sample collected with Twitter Sample API. Events will then be
      detected by clustering following the same procedure as [17].

3.1    Vocabulary-constrained collection
Our query building tool is composed of several modules designed to (1) extract
relevant keywords from AFP dispatches and build queries with them, (2) expand
those queries using Twitter vocabulary, (3) validate that these queries return
tweets linked to the event. All steps are summarized in Figure 1.
    AFP dispatches comprise a title, a timestamp and a body containing a few
lines of text describing a news event. 900 dispatches are published every day in
average. We define a news event very simply by a new AFP dispatch with a title
containing less than k words in common with the titles of previous dispatches
published in a time window T . We do not proceed to more complex dispatch
clustering since our observations show that when AFP journalists publish several
dispatches on a given event, they usually keep the same title and only update
the main text with additional details. In practice we have set k to 3 and T to
24 hours. For each event, we aim at building a set of queries that will retrieve
tweets related to that event.

Keywords extraction For a given dispatch, we extract named entities from
the title and the body. We then build queries using combinations of those entities
following a number of rules: two locations cannot form a query (to avoid queries
like “Alep Syria”), a query must contain at least two entities and less than four
entities. This method is quite similar from what is done by Becker et al. [9].
We also take the exact title of the dispatch as a query, if it contains no named
entities. The tweets containing these queries are then collected.

Query expansion We combine several methods to build new queries from
collected tweets:
 – Urls extraction: like Ning et al. [11], we extract urls from collected tweets
   and extend our query set with these urls.
 – Keywords detection: we build a TF-IDF matrix from all terms of the tweets
   collected in the past 7 days. All tweets related to the same event form a doc-
   ument. We pick terms with a TF-IDF score higher than a defined threshold
   (different for 1-grams, 2-grams and 3-grams)
 – Synonyms: We use the CBOW model [18] with negative sampling [19] to
   build vector representations of words in Twitter vocabulary and find poten-
   tial “synonyms” (terms used in very similar contexts) to expand our first
   queries. We train the model on a sample of French tweets collected with the
   Sample API in the past 7 days to get a vector representation of Twitter
   words. We define as “synonyms” of a word the words that have a cosine sim-
   ilarity higher than a certain threshold with that word-vector. For instance,
   using 0.85 as a threshold, we get “fh” (abbreviation for “François Hollande”)
   as synonym for “hollande”. We then build new queries by replacing terms of
   previous queries with their synonyms if they have some.


                                                                                     27
Query validation The previous steps allow us to combine a large number of
words to build queries that might be related to the given event. However, these
methods can also produce wrong queries, either not precise enough (“Elysee
Macron” for example, since l’“Elysée” is the common name for the French pres-
ident’s office), or clearly pointing to something or someone else than the entities
extracted from the original dispatch. For instance, the synonyms of “sarkozy”
are “sarko” (which is correct), “juppe” and “fillon” (who are politicians from
the same party as Nicolas Sarkozy, but not the same person). We therefore need
a method to filter incorrect queries.
    We can make the following hypothesis: if a news event is discussed on Twitter,
the tweets containing words related to that news should form a peak in a time
window of a few hours before or after the time of the event. If the tweets collected
through a certain query do not peak around the event’s time, it is either because
the news is not discussed on Twitter, or because the query is not linked to the
event in question. Starting from that assumption we are working on an algorithm
to detect such anomalies on the time series of tweets and thus remove irrelevant
queries.


Iteration The keywords extraction and query validation steps are repeated to
increase the number of tweets linked to the considered event, until the set of
collected tweet does not provide any new query.


3.2   Evaluation of the tweets collection tool

To the best of our knowledge, there is no dataset corresponding to our task. We
are currently designing a user interface to manually label tweets. This interface
will allow two types of evaluations:

 – Precision: given the text of an AFP dispatch and the list of tweets collected
   by our tool as presumably linked to the dispatch, the user will assess if each
   tweet is correctly attributed.
 – Recall: the user has to manually formulate queries to retrieve as many tweets
   as possible in relationship to a given AFP dispatch, and assess if the returned
   tweets are indeed linked to the dispatch. This method, however, is only a
   partial solution to the problem of evaluating the proportion of tweets not
   found by our tool, since the user will not necessarily find all possible terms to
   query relevant tweets. Another way of approximating the recall would be to
   compare the proportion of tweets in each collected set to the size of clusters
   identified with our second approach.


3.3   Random tweets collection

Twitter Sample API returns 1% of the world stream randomly selected[2], among
which 1.8% on average are in French. 1% of the French stream represents in
average only 2400 tweets an hour. A previous article [17] has estimated the


                                                                                       28
proportion of news-related documents in tweets corpora to less than 0.2%. If
this ratio is the same for French tweets, it means that we would receive between
4 and 5 news-related tweets in French per hour by using the Sample API, a
volume much too small for our goals.
    Using the Filter API on “neutral” terms with “French” as language param-
eter allows us to collect a larger volume of tweets in French, but Twitter does
not communicate about the tweets distribution. Since we need to ensure that
the stream of collected tweets is representative of the tweets emitted on Twitter
at the same moment, we designed some methods to compare it with the stream
from the Sample API.
    Joseph et al. [20] compare several samples collected with the Filter API on
the same keywords using different connection tokens6 : they find that two samples
taken at the same time with the same keywords as inputs are “nearly identical”
(96% of the tweets are the same, and the sets of tweets that are not have a very
similar structure in terms of number of hashtags, urls and mentions per tweet).
It is thus not possible to use several connections with the same keywords to get
a higher number of tweets. However, spreading different keywords over several
API connections should return a higher number of tweets. If these tweets are
representative of the real activity on Twitter, the distribution of words within it
should be the same (proportionally) to the one of tweets from the Sample API.
To ensure that it is the case, we worked on optimizing the collect parameters
(number of connections, number of keywords, distribution of keywords over the
different tokens) by performing the steps described below:


Finding the most frequent keywords on Twitter. We collected tweets
from the Sample API during three months and selected French tweets (retweets
excluded). After removing punctuation, capital letters and non-Latin characters,
we built a list of the 200 most frequent words in the French Twitter vocabulary.


Clustering keywords depending on their co-occurrences. We built a
matrix of co-occurrences of the first 50, 100 and 200 most frequent words in the
sampled tweets, and used it to cluster keywords in n clusters (n ∈ [2, 4]). By
doing so, we aimed at putting together terms that are frequently used together,
to collect sets of tweets with the smallest possible intersection. In total, we
had 9 ways of collecting tweets (50, 100 or 200 words, spread over 2, 3 or 4
clusters). To control that our clustering approach was the best, we also tried
to randomly spread the same number of words (50, 100 or 200) over the same
number of Twitter access tokens (2, 3 or 4). We ran each test during 24 hours
and compared the returned tweets with those obtained using the Sample API
during the same period of 24 hours.
6
    To use the Twitter API, a connection token is required. Twitter limits the access
    to its data by generating only one connection token per Twitter account. The total
    number of tweets that one can get with only one token is limited to 1% of the global
    tweets volume at a given moment.


                                                                                           29
Running tests simultaneously. In order to select the best collection method
we had to run series of tests in parallel: it is not possible compare tests conducted
on different periods of times since the differences between the results could be
due to a different tweets distribution between periods. However, we only had
13 Twitter access tokens, and thus could not run all tests simultaneously. Since
a collection method requires 2 to 4 tokens, we could only run 6 to 3 tests in
parallel.

3.4   Evaluation of random tweets collection
We used Kullback-Leibler divergence to compare the words distributions of the
set of collected tweets to the distribution from the Sample API. The best col-
lection method should have a words distribution very similar to the distribution
of words collected with the Sample API, and thus the Kullback-Leibler diver-
gence between the two distributions should be close to zero. We also evaluated
our collection method with regards to the volume of collected tweets: we used
a simple ratio between the volume collected with each method and the volume
collected using the Sample API during the same period.

4     Results

Fig. 2. Daily evolution of the KL divergence between each collection method and the
Sample API


                                                                                        30
Fig. 3. Daily evolution of the ratio between the volume of collected tweets with each
collection method and the volume collected with the Sample API


Note: we have no results on the 01/22, since the connection to the API was broken.


4.1    Tweets collection tool

As explained in Section 3.2, we lack a labeled test-set to perform a correct
evaluation of our tweets retrieval method. Our work so far has been directed
towards the implementation of a robust architecture to stream and analyze a high
number of documents in near real-time. The development of our test interface
will allow us to provide concrete results in the near future. We also plan to
compare our strategy with the state of the art, presented in section 2.1.


4.2    Random tweets collection

Our results are displayed in Figures 2 and 3. We get consistent results over
time: the collection method using words clustered by co-occurrences performs
better in terms of proximity to the Sample than other methods using random
distribution of words. The volume of collected tweets greatly differs from one
collection method to another (from 50 to 75 times higher than the Sample),
but the trend also remains stable over time7 . However, these results need to be
7
    The drop in KL-divergence and in volume of collected tweets on the 01/21 could be
    linked to the fact that it was a Sunday, but we need to reproduce the experiment
    over a longer period to confirm that this drop is not a coincidence.


                                                                                        31
reproduced over a longer period to be confirmed, and we need to conduct wider
tests on all collection methods.

5     Related work
5.1   Detecting events in the Twitter stream
A large number of methods have been proposed for detecting “stories” [21] or
“events” [22–26] in a continuous stream of tweets. However these methods do
not link the detected clusters to news articles. Sankaranarayanan et al. [27] use
handpicked “seeders” (Twitter accounts that are known to publish news) and
train a naive Bayes classifier to filter news tweets from “junk” tweets. News
tweets are then clustered based on TF-IDF. Liu et al. [17] also adopt a two-
steps approach to detect news clusters in a continuous stream of tweets. First,
they use a noise-filtering algorithm based both on a set of filtering rules and
on information credibility features [28]. Then, they perform a clustering step
depending on a similarity metric containing named entities, verbs and common
nouns. These methods do not treat the problem of linking the collected tweets
to news articles, but they are interesting to perform a first selection of relevant
tweets.

5.2   Topic Models
Many methods designed to link tweets and news are based on topic models. A
topic model takes a collection of documents as input and returns a set of topics
and their distribution within the collection (a document is considered as a mix
of several topics). Latent Dirichlet Allocation (LDA) [29] is currently the most
commonly used topic model. Zhao et al. [4] perform LDA on a collection of
news articles and an LDA-like algorithm on a collection of tweets and match
the two sets of topics based on the similarity of words distribution. Other works
develop algorithms to jointly learn the topics of tweets and news datasets [7],
[8]. Guo et al. [5] design a more specific latent variable model including features
like hashtags, named entities and temporal relations to link each tweet of their
dataset to the closest news article. The previous works perform well on aligning
tweets to related news articles, but they are based on static datasets and do not
tackle the dynamic nature of news events.
     In 2006 Blei and Lafferty proposed Dynamic Topic Model (DTM) [30] a
model able to discover the evolution of topics over time. Mele et al. adapted
this model to news events [6] and used clustering to link the documents covering
the same event across several news streams (news websites, RSS feed, Twitter
accounts of media outlets).
     Overall, several works address the task of linking tweets and news but they
rely either on static datasets or on stream of tweets from manually selected users.
We could not find any approach questioning the representativity of the collected
tweets. Moreover, most approaches treat tweets and news articles only as text
documents and do not take their multimodal nature (urls, mentions of users,
pictures, videos) into account.


                                                                                      32
6    Conclusion and future work

In this paper, we present two types of methods to collect tweets related to news
events. The first one is a query formulation approach, that consists in generating
potential queries related to an event and using the time repartition of collected
tweets to detect and reject wrong queries. The second one is an approach based
on clustering randomly collected events. We detailed the tests used to ensure
that the collected tweets have a close to random distribution and showed from
our first results that neutral words clustered by co-occurrence tend to perform
better.
    In future works, we plan to develop a user interface that will enable the
labeling of numerous tweets, and give us a way of assessing the performance of
our search tool. We will also work on methods to dynamically link tweets to
news events, with a focus on processing tweets not only as text documents, but
rather as multimodal objects.


References

 1. Cagé, J., Hervé, N., Viaud, M.L.: The production of information in an online
    world: Is copy right? CEPR Discussion Paper #12066 (2017)
 2. Morstatter, F., Pfeffer, J., Liu, H.: When is it biased?: assessing the representative-
    ness of twitter’s streaming API. In: WWW, Companion Volume. (2014) 555–556
 3. Kothari, A., Magdy, W., Darwish, K., Mourad, A., Taei, A.: Detecting Comments
    on News Articles in Microblogs. In: ICWSM. (2013)
 4. Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.P., Yan, H., Li, X.: Comparing
    Twitter and Traditional Media Using Topic Models. In: ECIR. (2011) 338–349
 5. Guo, W., Hao, L., Heng, J., Diab, M.: Linking Tweets to News: A Framework to
    Enrich Short Text Data in Social Media. In: ACL. (2013) 239–249
 6. Mele, I., Bahrainian, S.A., Crestani, F.: Linking News across Multiple Streams for
    Timeliness Analysis. In: CIKM. (2017) 767–776
 7. Hu, Y., John, A., Wang, F., Kambhampati, S.: ET-LDA: Joint Topic Modeling
    for Aligning Events and their Twitter Feedback. (2012)
 8. Hua, T., Ning, Y., Chen, F., Lu, C.T., Ramakrishnan, N.: Topical Analysis of
    Interactions Between News and Social Media. In: AAAI. (2016) 2964–2971
 9. Becker, H., Chen, F., Iter, D., Naaman, M., Gravano, L.: Automatic Identification
    and Presentation of Twitter Content for Planned Events. In: ICWSM. (2011)
10. Tanev, H., Ehrmann, M., Piskorski, J., Zavarella, V.: Enhancing Event Descrip-
    tions through Twitter Mining. In: ICWSM. (2012)
11. Ning, Y., Muthiah, S., Tandon, R., Ramakrishnan, N.: Uncovering News-Twitter
    Reciprocity via Interaction Patterns. In: ASONAM. (2015) 1–8
12. Castillo, C., El-Haddad, M., Pfeffer, J., Stempeck, M.: Characterizing the life cycle
    of online news stories using social media reactions. In: CSCW. (2014) 211–223
13. Morstatter, F., Pfeffer, J., Liu, H., Carley, K.M.: Is the Sample Good Enough?
    Comparing Data from Twitter’s Streaming API with Twitter’s Firehose. In:
    ICWSM. (2013)
14. Wang, Y., Callan, J., Zheng, B.: Should we use the sample? analyzing datasets
    sampled from twitter’s stream api. ACM Trans. Web 9(3) (June 2015) 13:1–13:23


                                                                                              33
15. Diaz, F., Mitra, B., Craswell, N.: Query Expansion with Locally-Trained Word
    Embeddings. CoRR abs/1605.07891 (May 2016)
16. Roy, D., Paul, D., Mitra, M., Garain, U.: Using Word Embeddings for Automatic
    Query Expansion. CoRR abs/1606.07608 (June 2016)
17. Liu, X., Li, Q., Nourbakhsh, A., Fang, R., Thomas, M., Andersony, K., Kociubay,
    R., Veddery, M., Pomervilley, S., Wudali, R., Martiny, R., Dupreyy, J., Vachhery,
    A., Keenan, W., Shah, S.: Reuters Tracer: A Large Scale System of Detecting &
    Verifying Real-Time News Events from Twitter. In: CIKM. (2016) 207–216
18. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Rep-
    resentations in Vector Space. CoRR abs/1301.3781 (2013)
19. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed rep-
    resentations of words and phrases and their compositionality. In: NIPS. (2013)
    3111–3119
20. Joseph, K., Landwehr, P.M., Carley, K.M.: Two 1% s Don’t Make a Whole: Com-
    paring Simultaneous Samples from Twitter’s Streaming API,. In: SBP. (2014)
    75–83
21. Petrović, S., Osborne, M., Lavrenko, V.: Streaming first story detection with
    application to twitter. In: HLT-NAACL. (2010) 181–189
22. Atefeh, F., Khreich, W.: A Survey of Techniques for Event Detection in Twitter.
    Computational Intelligence 31(1) (February 2015) 132–164
23. Aggarwal, C.C., Subbian, K.: Event detection in social streams. In: SIAM. (2012)
    624–635
24. Weng, J., Lee, B.S.: Event Detection in Twitter. In: ICWSM. (2011)
25. Liang, S., Yilmaz, E., Kanoulas, E.: Dynamic Clustering of Streaming Short Doc-
    uments. In: KDD. (2016) 995–1004
26. Zhang, C., Liu, L., Lei, D., Yuan, Q., Zhuang, H., Hanratty, T., Han, J.: TrioVe-
    cEvent: Embedding-Based Online Local Event Detection in Geo-Tagged Tweet
    Streams. In: KDD. (2017) 595–604
27. Sankaranarayanan, J., Samet, H., Teitler, B.E., Lieberman, M.D., Sperling, J.:
    TwitterStand: news in tweets. In: GIS. (2009) 42–51
28. Castillo, C., Mendoza, M., Poblete, B.: Information credibility on twitter. In:
    WWW. (2011) 675–684
29. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine
    Learning research 3(Jan) (2003) 993–1022
30. Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: ICML. (2006) 113–120


                                                                                          34

</pre>