=Paper= {{Paper |id=Vol-1670/paper-52 |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-1670/paper-52.pdf |volume=Vol-1670 }} ==None== https://ceur-ws.org/Vol-1670/paper-52.pdf
        Scalable Detection of Emerging Topics and
      Geo-spatial Events in Large Textual Streams


              Erich Schubert, Michael Weiler, and Hans-Peter Kriegel


                    Institut für Informatik, LMU Munich, Germany

                     {schube,weiler,kriegel}@dbs.ifi.lmu.de

      Social media are a popular source for live textual data. This data poses
several challenges due to its size, velocity, and heterogeneity. Existing methods
for emerging topic detection often are only able to detect events of a global
magnitude such as natural disasters, or they can only monitor user-selected
keywords or a curated set of hashtags. Interesting emerging topics may, however,
be of much smaller magnitude and may involve the combination of two or more
words that are not yet known in beforehand.
      We present several contributions introduced in previous work [1, 2]:
(i) A signicance measure that can detect emerging topics early, long before they
evolve into hot tags, by drawing upon experience from outlier detection.
(ii) An ecient online algorithm to track these statistics for all words and word-
pairs with only a xed amount of memory, and without predened keywords.
(iii) The clustering of the detected co-trends into larger topics, because a single
event will cause multiple word combinations to trend at the same time.
(iv) How to incorporate location information into this process to both allow re-
porting the locality of events as well as detecting local-only geo-textual patterns.
      The signicance score provides an estimated frequency and standard devia-
tion of words, word-pairs, and word-location information on the data stream at
minimal cost. It allows for normalization across location, culture, and language
and enables the detection of change events both in already frequent and not
previously seen combinations. In contrast to earlier work, it can monitor every
word at every location with only a xed amount of memory, compare the val-
ues to statistics from earlier data, and immediately report signicant deviations
with minimal delay. The algorithm is capable of reporting Breaking News in
real-time as they happen in social media around the world. Location is modeled
at dierent granularities, such that events can be detected at a city, country, or
global level by incorporating OpenStreetMap data, or at particular coordinates.



References
[1]    E. Schubert, M. Weiler, and H.-P. Kriegel.  SigniTrend: Scalable Detection
       of Emerging Topics in Textual Streams by Hashed Signicance Thresholds .
       In:Proc. 20th ACM Int. Conf. on Knowledge Discovery and Data Mining
       (SIGKDD). 2014, pp. 871880.
[2]    E. Schubert, M. Weiler, and H.-P. Kriegel.  SPOTHOT: Scalable Detection
                                                          Proc. 28th Int. Conf.
       of Geo-spatial Events in Large Textual Streams . In:
       on Scientic and Statistical Database Management (SSDBM). 2016.