=Paper=
{{Paper
|id=Vol-2277/paper36
|storemode=property
|title=
Discovering Novel Emergency Events in Text Streams

|pdfUrl=https://ceur-ws.org/Vol-2277/paper36.pdf
|volume=Vol-2277
|authors=Artem Shelmanov,Dmitriy Deviatkin,Daniil Larionov
|dblpUrl=https://dblp.org/rec/conf/rcdl/ShelmanovDL18
}}
==
Discovering Novel Emergency Events in Text Streams
==
<pdf width="1500px">https://ceur-ws.org/Vol-2277/paper36.pdf</pdf>
<pre>
           Discovering Novel Emergency Events in Text Streams
                 © Dmitriy Deviatkin1            © Artem Shelmanov1                       © Daniil Larionov2
                   devyatkin@isa.ru           shelmanov@isa.ru dslarionov@protonmail.com
     1
         Federal Research Center “Computer Science and Control” of Russian Academy of Sciences,
                                               Moscow, Russia
                                 2
                                   People's Friendship University of Russia,
                                               Moscow, Russia
          Abstract. We present text processing framework for discovering emergency related events via analysis of
          information sources such as social networks. The framework performs focused crawling of messages, text
          parsing, information extraction, detection of messages related to emergencies, as well as automatic novel
          event discovering and matching them across different information sources. For detection of emergency-
          related messages, we use CNN and word embeddings. For discovering novel events and matching them
          across different sources, we propose a multimodal topic model enriched with spatial information and a
          method based on Jensen–Shannon divergence. The components of the framework are experimentally
          evaluated on Twitter and Facebook data.
          Keywords: event detection, topic modelling, monitoring, named entity recognition, text processing, novel
          topic.

                                                                      different locations at the same time despite they generate
 1       Introduction                                                 topically similar text streams (e.g. destructions caused by
                                                                      a single storm that moves across a country should be
 Recent research showed that Twitter, Facebook, and                   identified as different events).
 other social networks have valuable applications in                      The task set in this work has a global spatial
 emergency situations. Since large-scale emergency                    restriction. In particular, we are interested primarily in
 events give rise to a massive publication activity in social         the events and messages from the Arctic zone. This
 networks [35], these resources accumulate information                restriction brings additional difficulties due to sparseness
 about situation in affected areas, infrastructure damage,            of data, lack of ready-to-use software, methods, and
 casualties, requests and proposals for help. They have               linguistic resources needed for text processing.
 already been used for enhancing situation awareness of                   In this work, we evaluate several models for detection
 affected people and emergency response teams [3, 21,                 of emergency related messages based on various types of
 15], as well as for online detecting and monitoring                  embeddings and classification techniques including deep
 emergency events like earthquakes [27, 29]. Advanced                 learning. We present a multimodal topic model for event
 information retrieval techniques can detect emergencies              discovering that leverages spatial information, as well as
 in text streams automatically so direct appeals to the               describe approaches to assessing event novelty and
 rescue services through the standard channels may not be             matching events from different information sources. The
 needed.                                                              experimental evaluations on collections of messages
     This research continues the previous studies                     from Twitter and Facebook show that our methods
 presented in [10, 11] that are devoted to monitoring                 outperform the baselines.
 restricted geographical regions via social networks for                  The rest of the paper is structured as follows. Section
 enhancing situation awareness during emergency                       2 reviews the related work on methods for novel
 situations. In this work, we solve the task of automatic             topic/event detection in text streams. Section 3 describes
 identification of emergency events in a stream of text               the natural language pipeline of our system including the
 messages. We consider an event in a text stream as a                 subsystem for extraction of emergency related messages.
 group of topically related messages that reflect a real-life         Section 4 presents the developed method for novel
 event in a small time period. Since we are looking for               emergency event discovering and matching across
 emergency events, it is crucial to detect them as soon as            information sources. Section 5 describes the
 possible: long before they become trendy and gain high               experimental evaluation of methods. Section 6 concludes
 amount of publications. Therefore, one of the                        and outlines the future work.
 peculiarities of this task is the problem of identification
 of novel topics that correspond to emergency events. It is
 also important to distinguish events (earthquakes, fire
 breakouts, storms, hurricanes, etc.) that happen in

Proceedings of the XX International Conference
“Data Analytics and Management in Data Intensive
Domains” (DAMDID/RCDL’2018), Moscow, Russia,
October 9-12, 2018


                                                                208
2   Related Work                                                    detection of emerging keywords. They consider
                                                                    frequencies of words as signals and decode these signals
The work related to our current research includes                   with wavelet analysis. Some trivial words are filtered
publications considering the tasks of event detection in            away by analyzing their corresponding signal
microblogs, topic evolution tracking, as well as emerging           autocorrelations. The remaining words are then clustered
topic detection. Most of the approaches to these problems           to form events with a modularity-based graph
can be divided into two major groups.                               partitioning technique. In [8], a real-time framework for
     The first group of methods for emerging event                  detecting hot emerging topics for organizations in social
detection and tracking primarily relies on topic models             media context is presented. Authors discover emerging
adopted to temporal aspects of the task. They are based             topics and extract emerging features from both the
on different modifications of PLSA models [13] (often               organization and topic perspectives. They extract
LDA [6]). One of the fundamental works in this area is              emerging terms by leveraging chi-square test for
[5]. It proposes several dynamic topic models that align            foreground and background distributions of terms.
topics across time steps with logistic normal distribution,         Topics are discovered by incremental k-means type
train with approximation based on variational Kalman                clustering algorithm. To perform timely identification of
filters and perform inference with the help of wavelet              hot emerging topics, authors proposed two semi-
regression. Another fundamental model named “topics                 supervised classifiers (based on co-training and self-
over time” is presented in [30]. Authors propose a                  learning). Authors engineered several features that
method for jointly modelling both word co-occurrences               incorporate an authority of a source, importance of
and localization in continuous time without employing               keywords, number of retweets, and some other aspects.
Markov assumption. Another topic model that takes into              In [28], the emerging keywords are identified using
account temporal dimension is on-line LDA presented in              significance measure based on outlier detection
[1]. In this approach, distributions generated on the               algorithm. More specifically, authors used exponentially
previous time steps are used as priors for word generation          weighted average of terms and co-occurring terms. For
on the current step. For each topic, the method builds              detection of novel events, in [20], researchers propose to
transformation matrix that captures the evolution of the            use instead of single unigrams so called “event
topic over time. Authors consider a topic as emerging if            segments” – key phrases for an event that possibly refer
it is significantly different from topics in the same time          to named entities or semantically meaningful
period or from all topics seen before. For topic                    information units. They cluster event segments into
comparison, Kullback-Leibler divergence is used. In                 events considering both their frequency distribution and
[31], researchers instead of creating monolith Bayesian             content similarity. Emerging segments are detected by
model propose to learn a topic model and a transition               abnormal frequency distribution of the tweet and user
matrix to shift distributions over discrete time steps.             frequencies of the segments. Importance of an event is
They formulate the problem of model learning as                     also determined by Wikipedia. Authors consider
minimizing the least square error between predicted                 segments that frequently appear as anchors in Wikipedia
topic distribution using transformation and the actual              more favorable. This approach is intended for finding the
topic distribution of new documents. The proposed                   most realistic events and to derive the most newsworthy
approach provides the ability to predict topic trends in            segments to describe the identified events.
the future. Other notable related work on topic models                  The method presented in [14] combines two
for emerging topic detection in microblog data include              aforementioned approaches: it uses topic modelling in
Twitter-LDA [12], BBTM (bursty biterm topic model)                  conjunction with models for emerging terms detection. Topic
[34], and TopicSketch [33].                                         models are used to detect topic distributions in each time
     The second group of methods is based on detection              interval. Term novelty is estimated by local weighted linear
of emerging features like terms, keywords, or token                 regression. In order to advance from detection of term novelty
segments, and clustering of them. In [7], to define                 to detection of topic novelty, authors solve optimization
emerging terms authors use two metrics named                        problem. The solution gives novelty and fading probabilities
“nutrition” and “energy function” (biology metaphor).               for a topic. Based on these two probabilities, topic evolution
Nutrition of a term is calculated as a sum of modified              operations are defined subsequently to identify emerging
term frequency in a tweet multiplied by author                      topics from the large number of latent ones and track how
importance (calculated via PageRank) summed through                 these topics evolve over time. To compare topics, authors use
all tweets in a time period. The energy function of a term          Jensen-Shannon distance.
is proportional to the difference of its current nutrition              Another approach to emergency event detection employ
and its nutrition in the previous time intervals. Authors           dictionary learning method [17]. The dictionary contains
declare a term as emerging if its energy value is more              topics, which are consist of atoms (numerical vectors). Vector
than “critical drop” value, which is proportional to the            representation of documents can be approximated with a
average energy of all terms in the current time period.             linear combination of such atoms. The method consists of
Using cooccurrence of terms, authors build a graph with             two steps: determining novel documents in a text stream and
edges that correspond to the strongest relationships                identifying a cluster structure among the novel documents. In
between terms. The emerging terms become seeds of                   the first step, the method checks whether a new document can
strongly connected components that finally represent                be represented as a sparse linear combination of known atoms
emerging topics. Authors of [32] use wavelet analysis for           with low error. If it is not the case, the document is considered


                                                              209
novel. Such documents are used to learn a new dictionary of            achieve the ability to normalize extracted textual
novel topics. On the second step, the learned dictionary is            information into geographic coordinates, in the previous
used to build clusters of similar novel messages. These                work, we implemented a rule- and dictionary-based
clusters are considered as emerging topics.                            module [10]. We created a gazetteer from Geonames 20
    Our approach to novel event discovering is based on                and supplied it with several filtering rules based on
multimodal topic modeling and takes into account spatial               postags of extracted tokens. Geonames also provides
information. Its key benefits compared to the previous                 mapping of locations into the geographic coordinates.
work are the following.                                                    To extract and normalize temporal expressions, we use
● It allows to separate similar emergency events                       a combination of two tools: spaCy 21 (NLP framework
     happened in different locations (for example, storms              based on deep learning) and a datetimeparser 22 (a library
     or typhoons).                                                     based on a set of hand-crafted rules).
● It provides an obvious way to match messages from                        For extraction of ship names, in the previous work
     different sources (social networks) taking into                   [11], we implemented a hybrid approach. On the basis of
     account location information.                                     a database of ship names, we implemented a gazetteer
● It can help to reveal location information of an event               that has high recall but low precision due to the fact that
     from a set of scattered messages.                                 many generic words appear to be ship names. To mitigate
                                                                       this problem, we also trained a neural network based on
3      Natural Language Processing Pipeline                            C-LSTM architecture [36]. The network filters out
                                                                       erroneous cases generated by the gazetteer and
Our method for event discovering needs complex                         drastically improves precision and overall F1-score of
preprocessing of natural language texts. We perform                    ship name detection.
basic linguistic analysis, named entity recognition, time
recognition, and detection of emergency related texts.                 3.3     Detection of Emergency Related Messages
    The final results of the natural language processing
                                                                       For detection of emergency related tweets, in the
pipeline are used for three tasks: focused crawling,
                                                                       previous work, we also used a combination of a gazetteer
enriching information about events, creating
                                                                       and a neural network based on C-LSTM architecture.
modularities for topic models.
                                                                       The gazetteer is based on the CrisisLex lexicon, proposed
3.1     Basic Linguistic Analysis                                      in [23]. This gazetteer generates many false positives that
                                                                       are filtered out by the neural network. To create this
The basic linguistic analysis includes tokenization,                   solution, in the previous work, we collected a corpus of
sentence splitting, pos-tagging, lemmatization, and                    tweets and trained a neural network on it. In this work,
syntax parsing. The pipeline is implemented via                        we improve the module for detection of emergency
IsaNLP 19 – a library that organizes various NLP                       related messages by incorporating more labeled data
components for English and Russian. In this paper, we                  from CrisisLex corpora [24] and by exploring:
perform experiments only with English texts, therefore,                ● Various embeddings: word-level: fastText [16]
the constructed pipeline contains only components for                       (trained on our own corpus / pre-trained on English
parsing English.                                                            Wikipedia), GloVe [26] (Common Crawl with
    Tokenization, sentence splitting, postagging, and                       dimension 300 / Twitter with dimension 200),
lemmatization are performed by components based on                          Word2Vec [22], sentence-level: InferSent [9].
NLTK toolkit [4]. The syntax parsing is performed by                   ● Various types of models: logistic regression (from
SyntaxNet McParseface [2].                                                  scikit-learn), random forest (from scikit-learn),
                                                                            gradient boosting on decision trees (LigthGBM
3.2     Named Entity Recognition                                            algorithm [18]), fully-connected network (FCN),
We perform extraction of the following types of objects:                    convolutional neural network (CNN), and C-LSTM
person’s names, organizations, geographical locations,                      as before.
and ship names. For basic NER extraction, we use
Polyglot framework. This system uses distant supervision
on Wikipedia for learning underlying model and is able to
perform named entity recognition for 40 languages.
However, we note that performance of such an approach
is not suitable for location extraction due to lack of recall.
High recall of spatial information is needed to perform
filtering of the text stream and topic modelling. Wikipedia
lacks many miscellaneous locations, therefore, there is not
enough data for training a good model. Polyglot also lacks
the ability to normalize locations.
     To improve the recall of location extraction and

19
     https://github.com/IINemo/isanlp                                  21
                                                                            https://spacy.io/
20
     http://www.geonames.org/                                          22
                                                                            https://github.com/scrapinghub/dateparser


                                                                 210
                                                                                      Facebook                                                Linguistic
                                                                                                                                             processing


                                                                                                             Natural language processing
                                                                                                                                            Information
                            Natural language processing                                                                                      extraction
                                            Information                                                                                      (locations,
                                                                 Detect
                              Linguistic     extraction                                                                                     objects, etc.)
                                                               emergency         Topic crawling of
                             processing      (locations,
                                                                messages            Facebook                                                   Detect
                                            objects, etc.)
                                                                                                                                             emergency
                                                                                                                                              messages


                            Twitter topic                    Multimodal topic    Generate queries                                           Check topic
                              crawling                          modeling           for Facebook                                              similarity


                                                                  Filter
                              Twitter                                                 Find novel                                           New emergency
                                                               background
                                                                                        events                                               messages
                                                                 topics


                                            Figure 1. Emergency event detection process
    For logistic regression, random forest, gradient                                  from other sources (Facebook in particular). Then, we
boosting algorithms, as well as for FCN we averaged                                   apply emergency detection method again and filter out
word embeddings and used the result vector as features.                               all irrelevant posts. The trained topic model is used to
Word-level embeddings in C-LSTM and CNN were                                          check whether the remaining messages are topically
processed in a standard way. Sentence-level embeddings                                similar to the events extracted from Twitter.
were not used in C-LSTM and CNN since these
architectures work only with sequences. For the rest                                  4.1          Identification of Events
algorithms, sentence-level embeddings were used as                                    In the first step, we discretize the timeline into small time
common features.                                                                      periods (one day in the experiments). In each time period,
    The fully-connected network is a simple 2-layer                                   multimodal topic model with additive regularization [37,
perceptron with dropout in the middle. The first layer                                38] is trained.
activation function is ReLU, the outputs of the last layer                                Let 𝐷𝐷 be a collection of tweets from a time period, let
are passed through the softmax. The architecture of                                   Def be a default modality (regular event-related lexis)
convolutional neural network for sentence classification                              and let Loc be a modality devoted to location of events.
was proposed in [19]. In this architecture, padded                                    The main reason to use such modalities is to separate
sequence of word embeddings is processed by a one-                                    similar events happened in different places in one period
dimensional convolution layer, followed by max pooling                                of time. We consider each message 𝑑𝑑 ∈ 𝐷𝐷 as a set of
layer to reduce dimensionality. The result vectors are                                tokens, related to those modalities 𝑊𝑊 = 𝑊𝑊𝑑𝑑𝑑𝑑𝑑𝑑 ∪ 𝑊𝑊𝑙𝑙𝑙𝑙𝑙𝑙 .
stacked into a single one and are fed into fully-connected
                                                                                      The goal of the topic modeling is to find factorization for
layer to make a prediction. Activation functions for
                                                                                      matrix of empirical probabilities for documents and
convolutional and fully-connected layers are set to ReLU
                                                                                      tokens:
and softmax respectively. The architecture of C-LSTM
consists of 1-d convolution layer with ReLU activation                                                   𝑝𝑝̂ (𝑤𝑤|𝑑𝑑) ≈ 𝑝𝑝(𝑤𝑤|𝑑𝑑) =
and max pooling followed by a LSTM recurrent layer.
                                                                                                                                                                       (
                                                                                                        ∑𝑡𝑡∈𝑇𝑇 𝑝𝑝(𝑤𝑤|𝑡𝑡)𝑝𝑝(𝑡𝑡|𝑑𝑑) =
The final predictions are made by two dense layers with                                                 ∑𝑡𝑡∈𝑇𝑇 𝜑𝜑𝑤𝑤𝑤𝑤 𝜃𝜃𝑡𝑡𝑡𝑡 , ∀𝑤𝑤 ∈ 𝑊𝑊 .
                                                                                                                                                                     1)
hyperbolic tangent and softmax activations. Neural
networks were implemented with PyTorch [25].                                             This problem could be solved by maximizing the
                                                                                      weighted sum of the following log-likelihoods with
4 Emergency Event Detection Method                                                    additive regularizers:

The pipeline for emergency event detection is depicted in                                                                                      𝐿𝐿(𝛷𝛷, 𝛩𝛩)
Figure 1. In the first step, we collect all messages from                               = � 𝛾𝛾 � � 𝑛𝑛𝑑𝑑𝑑𝑑 𝑙𝑙𝑙𝑙 � 𝜑𝜑𝑤𝑤𝑤𝑤 𝜃𝜃𝑡𝑡𝑡𝑡 +
Twitter using topic search API [11] and crisis-related                                                                                                               (2)
                                                                                             𝛾𝛾∈𝛤𝛤     𝑑𝑑∈𝐷𝐷 𝑤𝑤∈𝑊𝑊𝛾𝛾                                         𝑡𝑡∈𝑇𝑇
lexicon. Then, we detect emergency related messages
among crawled tweets using methods described in section                                     𝛼𝛼𝑅𝑅𝑠𝑠𝑠𝑠 (𝛩𝛩) + 𝛽𝛽𝑅𝑅𝑠𝑠𝑠𝑠 (𝛷𝛷) + 𝜏𝜏𝑅𝑅𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 (𝛷𝛷𝑙𝑙𝑙𝑙𝑙𝑙 )
3.3 and filter out all irrelevant tweets.                                                                                                   → 𝑚𝑚𝑚𝑚𝑚𝑚𝛷𝛷,𝛩𝛩 .
    In the second step, we train multimodal topic model
to identify emergency events described by messages and
then determine novel events among them by comparing
term distributions of the events from adjacent time
periods.
    In the third step, we use event-related and location-
related lexis from the obtained topics to crawl messages


                                                                                211
Table 3. Results of the models for emergency-related message detection (F1-score), %
                                                                          Embedding features
 Models                     FstTrain                 FstWiki           GloveCC          GloveTwt            W2V            InferSent

 LogReg                     87.4±8.4                  82.5±9.2          88.6±5.3        85.1±6.9          88.9±6.7          89.4±4.9
 Rnd For.                   86.9±9.5                 82.3±11.1          87.4±7.4        83.9±10.5         87.4±8.9          89.4±4.9
 GBDT                       91.7±0.1                  89.8±0.1          93.0±0.1        89.8±0.2          92.0±0.2            N/A
 FCN                        90.9±0.3                  89.8±0.1          92.2±0.3        88.0±0.2          91.2±0.3          90.8±0.2
 CNN                        94.3±0.3                  93.4±0.3          93.8±0.2        92.7±0.2          92.9±0.2            N/A
 CLSTM                      92.1±0.2                  92.2±0.3          92.2±0.6        91.5±0.5          92.3±0.5            N/A


    Here 𝛾𝛾 ∈ Γ = {𝛾𝛾𝑑𝑑𝑑𝑑𝑑𝑑 , 𝛾𝛾𝑙𝑙𝑙𝑙𝑙𝑙 } are weights of the                     earlier similar topics in a predefined time window.
modalities, Φ is a matrix of token probabilities for topics,
and Θ is a matrix of topic probabilities for documents. As                      4.3   Events Matching
in [37], we apply smooth-sparse regularizers to achieve                         In the third step, we match messages related to the same
smooth term distributions in topics and sparse topic                            event from different sources, which can be various types
distributions in messages:                                                      of social networks or mass media sites. In experiments,
                                                                                we enriched messages from Twitter related to novel
            𝑅𝑅𝑠𝑠𝑠𝑠 (𝛷𝛷) = � 𝐾𝐾𝐾𝐾(𝛽𝛽𝑡𝑡 ||𝜑𝜑𝑤𝑤𝑤𝑤 ),                (3)            emergency events with Facebook public posts. For each
                           𝑡𝑡∈𝑇𝑇                                                novel event, we construct a search query as a
                                                                                combination of default and location tokens with the
                                                                                highest weights. To crawl Facebook, we use Ghost.py 23
          𝑅𝑅𝑠𝑠𝑠𝑠 (𝛩𝛩) = − � 𝐾𝐾𝐾𝐾(𝛼𝛼𝑑𝑑 ||𝜃𝜃𝑡𝑡𝑡𝑡 ),                (4)            library.
                            𝑑𝑑∈𝐷𝐷                                                   We filter obtained posts (leaving only emergency
where 𝛼𝛼𝑑𝑑 and 𝛽𝛽𝑡𝑡 are sampled from some predefined                            related messages) as described in section 3.3 and extract
distributions.                                                                  named entities and locations from them. We infer topic-
    We apply decorrelation regularizer only for location                        probabilities matrix Θ � for remaining posts using the
modality to be able to detect similar events happened in                        pretrained model for the event. Then, we filter all
different places at the same time:                                              messages, which are not topically similar to the event.
                                                                                Due to the use of multimodal models, information about
     𝑅𝑅𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 (𝛷𝛷𝑙𝑙𝑙𝑙𝑙𝑙 ) = − �      � 𝜑𝜑𝑤𝑤𝑤𝑤 𝜑𝜑𝑤𝑤𝑤𝑤 .        (5)           locations is also taken into account when assessing the
                              𝑡𝑡,𝑠𝑠∈𝑇𝑇 𝑤𝑤∈𝑊𝑊𝑙𝑙𝑙𝑙𝑙𝑙                              similarity of posts.
    We use BigARTM library [39] to train multimodal
                                                                                5 Experiments
models. The result is Φ and Θ matrices for each time
period. After that, “background” topics with high                               5.1   Detection of Emergency Related Messages
entropy of token distributions can be filtered.
                                                                                Dataset and Pre-processing
4.2      Detection of Novel Events
                                                                                For evaluation of method for detection of emergency
In the second step, we determine whether the extracted                          related messages, we use the CrisisLexT6 dataset. The
events were discussed before. We aggregate several                              dataset consists of 60,000 tweets related to 6 major crisis
adjacent periods of time to “time windows”. Consider we                         situations. Emergency related tweets are labeled as “on-
have topics s and t in the same time window. Denote                             topic” and others are labeled as “off-topic”. The pre-
vectors of token distributions for these topics as Φ𝑡𝑡                          processing procedure included elimination of the special
and Φ𝑠𝑠 . As in [14], we use Jensen–Shannon divergence                          characters, as well as conversion of hashtags, emojis, and
between token probabilities for the topics to estimate                          URLs into single tokens.
topic similarity:
                                                                                Hyperparameters
                                 1
           𝐽𝐽𝐽𝐽𝐽𝐽(𝛷𝛷𝑡𝑡 ||𝛷𝛷𝑠𝑠 ) = �𝐾𝐾𝐾𝐾(𝛷𝛷𝑡𝑡 ||𝑀𝑀)                              Logistic regression. Regularization: L2 penalty.
                                 2                                              Tolerance: 0.0001. Inverse regularization strength: 1.0.
                                 + 𝐾𝐾𝐾𝐾(𝛷𝛷𝑠𝑠 ||𝑀𝑀)�,               (6)             Random Forest. Number of estimators: 1,000. No
                                                                                limits to maximum number of features and tree depth.
                        1
                    𝑀𝑀 = (𝛷𝛷𝑡𝑡 + 𝛷𝛷𝑠𝑠 ).                                        Split quality measure: Gini impurity. Min number of
                        2                                                       samples per split: 2. Min number of samples per leaf: 1.
      A topic is denoted as a “new event” if there is no                           Gradient boosting. Maximum tree depth: 20. Number

23
     https://github.com/jeanphix/Ghost.py


                                                                          212
of leaves: 11. Learning rate: 0.05. Feature fraction 0.9.          method, we filtered out posts that were considered
Bagging fraction: 0.8. Min frequency: 5. Number of                 irrelevant to events extracted from Twitter. After
estimators: 4,000 with early stopping for 200.                     filtering, 1k Facebook posts left.
    FCN. Size of hidden layer: 256. Dropout: 0.5.
                                                                   Hyperparameters
Number of epochs: 10. Loss: cross entropy. Optimization
algorithm: Adam. Learning rate: 0.0001. Weight decay:              In our experiments, we applied grid search to tune
0. Batch size: 256.                                                weights of the regularizers for topic models. A criterion
    CNN. Kernel size: [3, 4, 5]. Number of filters: 512.           for the search was a weighted sum of model perplexity,
Dropout: 0.5. Optimization algorithm: Adam. Learning               model’s matrices sparsity and model’s pointwise mutual
rate: 0.0001. Loss: binary cross entropy. Batch size: 128.         information.
Vocabulary size: 10,001. Number of epochs: 10 with
early stopping for 3 epochs.                                       Results and Discussion

Results and Discussion                                             Since the experiments were conducted on open
                                                                   data, we estimated only precision of models. The
We use 5-fold cross-validation for evaluation. Results
                                                                   results are presented in Table 2. The experiment
are presented in Table 1. We discovered several insights
into problems with processing and analyzing crisis and             shows that the proposed approach outperforms
Twitter specific lexicon:                                          baseline LDA models. This confirms the importance
● Sentence-level embeddings are better than                        of using information about the locations in the
     averaging word vectors. Averaging embeddings of               framework. One can note relatively low precision
     all words in a tweet blur the real meaning of text.           for the events matching. We believe this is due to
     InferSent embedding model, which is constructed
                                                                   substantial lag of time between the message
     using NLI data and BiLSTM encoders, treats
     sentence as a single entity and performs more                 crawling and the event matching experiments.
     general projection process. But the higher                    Thus, true event-related posts may be treated by
     dimensionality (required to make accurate                     Facebook’s search as less actual than others.
     projections) makes it harder to use several
     classification algorithms.                                    Table 2. Results of the novel emergency event
● GloVe embeddings pretrained on a Common Crawl                    extraction method (Precision), %
     corpus show better results than Twitter specific                      Step          LDA        Multimodal
     embeddings. Sentence-level embeddings, pretrained                                 (baseline)     model
     on non-specific natural language inference data, also              All events        63.3        93.3
     show superior results. It seems reasonable that
     crisis-related lexicon differs from common Twitter
                                                                      Novel events        71.4        80.0
     lexicon and tends to be closer to common lexicon.              Event matching        60.0        67.0
     However, we should note that there is a lack of
     publicly available Twitter data for training. GloVe           6 Conclusion
     Twitter corpus contains only 27 billion words,
     which is much less compared to Common Crawl                   We considered several problems related to monitoring of
     corpus size of 840 billion words.                             social networks: detection of messages related to
● All neural network models have lower standard                    emergencies, extraction of novel events, and matching
     deviation of F1-score compared to other machine               events reflected in different text sources. For detection of
     learning algorithms (except GBDT). Therefore, the             emergency-related messages, we use CNN and word
     quality of neural networks could be much stable on            embeddings. For extraction of novel events and matching
     unseen data and less sensitive to the context.                them across different sources, we propose a multimodal
● Our best classifier (CNN for text classification +               topic modelling enriched with spatial information and
     fastText, trained on our dataset) outperforms models          Jensen–Shannon divergence.
     presented in the related work [40, 41, 42].                       We investigated the performance of different
                                                                   algorithms and embeddings for emergency-related
5.2   Novel Emergency Event Extraction                             message detection on CrisisLexT6 dataset and found that
Dataset and Pre-processing                                         the best solution is given by CNN with fastText
                                                                   embeddings. We also compared the proposed
We crawled 60k Twitter messages from April 1, 2018 to              multimodal topic model and the LDA baseline. The
April 12, 2018 using the focused crawler presented in              experimental results are promising and show that the
[11]. With the help of CNN neural network, we filtered             proposed framework could be useful for monitoring
out messages that are not related to emergency events,             emergency events via messages in social media.
which reduced the number of tweets in the dataset to                   In the future work, we are going to address the problem
5,200. The remaining tweets were analyzed with the                 of emergency event locating and create visualization tools
natural language processing pipeline and with the event            for presenting them on a geographic map.
discovery method. After that, we also crawled Facebook
posts for each extracted event. Using the developed


                                                             213
Acknowledgments. The project is supported by the                        detection in the Arctic zone. Communications in
Russian Foundation for Basic Research, project                          Computer and Information Science, pages 74–
numbers: 15-29-06082, 15-29-06045 “ofi_m”.                              88, 2017.
                                                                 [12]   Qiming Diao, Jing Jiang, Feida Zhu, and Ee-
References                                                              Peng Lim. Finding bursty topics from
                                                                        microblogs. In Proceedings of the 50th Annual
 [1]   Loulwah AlSumait, Daniel Barbará, and                            Meeting of the Association for Computational
       Carlotta Domeniconi. On-line LDA: Adaptive                       Linguistics, pages 536–544, 2012.
       topic models for mining text streams with                 [13]   Thomas Hofmann. Probabilistic latent semantic
       applications to topic detection and tracking. In                 indexing. In Proceedings of the 22nd annual
       Data Mining, 2008. ICDM’08. Eighth IEEE                          international ACM SIGIR Conference on
       International Conference on, pages 3–12, 2008.                   Research and Development in Information
 [2]   Daniel Andor, Chris Alberti, David Weiss,                        Retrieval, pages 50–57, 1999.
       Aliaksei Severyn, Alessandro Presta, Kuzman               [14]   Jiajia Huang, Min Peng, Hua Wang, Jinli Cao,
       Ganchev, Slav Petrov, and Michael Collins.                       Wang Gao, and Xiuzhen Zhang. A probabilistic
       Globally normalized transition-based neural                      method for emerging topic tracking in
       networks. In Proceedings of the 54th Annual                      microblog stream. World Wide Web, pages
       Meeting of the Association for Computational                     325–350, 2017.
       Linguistics, pages 2442–2452, 2016.                       [15]   Muhammad Imran, Carlos Castillo, Ji Lucas,
 [3]   Zahra Ashktorab, Christopher Brown, Manojit                      Patrick Meier, and Sarah Vieweg. AIDR:
       Nandi, and Aron Culotta. Tweedr: Mining                          Artificial intelligence for disaster response. In
       Twitter to inform disaster response.                             Proceedings of the companion publication of
       Proceedings of ISCRAM, pages 354–358, 2014.                      the 23rd International Conference on World
 [4]   Steven Bird, Ewan Klein, and Edward Loper.                       Wide Web Companion, pages 159–162, 2014.
       Natural language processing with Python:                  [16]   Armand Joulin, Edouard Grave, Piotr
       analyzing text with the natural language toolkit,                Bojanowski, and Tomas Mikolov. Bag of tricks
       2009.                                                            for efficient text classification. In Proceedings
 [5]   David M Blei and John D Lafferty. Dynamic                        of the 15th Conference of the European Chapter
       topic models. In Proceedings of the 23rd                         of the Association for Computational
       international conference on Machine learning,                    Linguistics, pages 427–431, 2017.
       pages 113–120, 2006.                                      [17]   Shiva Prasad Kasiviswanathan, Prem Melville,
 [6]   David M Blei, Andrew Y Ng, and Michael I                         Arindam Banerjee, and Vikas Sindhwani.
       Jordan. Latent dirichlet allocation. Journal of                  Emerging topic detection using dictionary
       machine Learning research, pages 993–1022,                       learning. In Proceedings of the 20th ACM
       2003.                                                            international conference on Information and
 [7]   Mario Cataldi, Luigi Di Caro, and Claudio                        knowledge management, pages 745–754, 2011.
       Schifanella. Emerging topic detection on                  [18]   Guolin Ke, Qi Meng, Thomas Finley, Taifeng
       Twitter based on temporal and social terms                       Wang, Wei Chen, Weidong Ma, Qiwei Ye, and
       evaluation. In Proceedings of the tenth                          Tie-Yan Liu. LightGBM: A highly efficient
       international workshop on multimedia data                        gradient boosting decision tree. In Advances in
       mining, 2010.                                                    Neural Information Processing Systems, pages
 [8]   Yan Chen, Hadi Amiri, Zhoujun Li, and Tat-                       3149–3157, 2017.
       Seng Chua. Emerging topic detection for                   [19]   Yoon Kim. Convolutional neural networks for
       organizations from microblogs. In Proceedings                    sentence classification. In Proceedings of the
       of the 36th international ACM SIGIR                              2014 Conference on Empirical Methods in
       conference on Research and development in                        Natural Language Processing (EMNLP), pages
       information retrieval, pages 43–52, 2013.                        1746–1751, 2014.
 [9]   Alexis Conneau, Douwe Kiela, Holger                       [20]   Chenliang Li, Aixin Sun, and Anwitaman Datta.
       Schwenk, Loc Barrault, and Antoine Bordes.                       Twevent: segment-based event detection from
       Supervised learning of universal sentence                        tweets. In Proceedings of the 21st ACM
       representations from natural language inference                  international conference on Information and
       data. In Proceedings of the 2017 Conference on                   knowledge management, pages 155–164, 2012.
       Empirical Methods in Natural Language                     [21]   Alan M. MacEachren, Anuj Jaiswal,
       Processing, pages 670–680, 2017.                                 Anthony C. Robinson, Scott Pezanowski,
[10]   D. Deviatkin and A. Shelmanov. Towards text                      Alexander Savelyev, Prasenjit Mitra, Xiao
       processing system for emergency event                            Zhang, and Justine Blanford. SensePlace2:
       detection in the Arctic zone. In Proceedings of                  GeoTwitter analytics support for situational
       Data Analytics and Management in Data                            awareness. In Proceedings of Visual Analytics
       Intensive Domains, pages 225–232, 2016.                          Science and Technology (VAST) on IEEE
[11]   D. Devyatkin and A. Shelmanov. Text                              Conference, pages 181–190, 2011.
       processing framework for emergency event


                                                           214
[22]   Tomas Mikolov, Ilya Sutskever, Kai Chen,               [32]   Jianshu Weng and Bu-Sung Lee. Event
       Greg S Corrado, and Jeff Dean. Distributed                    detection in Twitter. ICWSM, pages 401–408,
       representations of words and phrases and their                2011.
       compositionality. In Advances in neural                [33]   Wei Xie, Feida Zhu, Jing Jiang, Ee-Peng Lim,
       information processing systems, pages 3111–                   and Ke Wang. Topicsketch: Real-time bursty
       3119, 2013.                                                   topic detection from Twitter. IEEE
[23]   Alexandra Olteanu, Carlos Castillo, Fernando                  Transactions on Knowledge and Data
       Diaz, and Sarah Vieweg. CrisisLex: A lexicon                  Engineering, pages 2216–2229, 2016.
       for collecting and filtering microblogged              [34]   Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Jun
       communications in crises. In Proceedings of                   Xu, and Xueqi Cheng. A probabilistic model for
       ICWSM, 2014.                                                  bursty topic discovery in microblogs. In AAAI,
[24]   Alexandra Olteanu, Sarah Vieweg, and Carlos                   pages 353–359, 2015.
       Castillo. What to expect when the unexpected           [35]   Jie Yin, Sarvnaz Karimi, Bella Robinson, and
       happens: Social media communications across                   Mark Cameron. ESA: emergency situation
       crises. In Proceedings of the 18th ACM                        awareness via microbloggers. In Proceedings of
       Conference on Computer Supported                              the 21st ACM International Conference on
       Cooperative Work & Social Computing, pages                    Information and Knowledge Management,
       994–1009, 2015.                                               pages 2701–2703, 2012.
[25]   Adam Paszke, Sam Gross, Soumith Chintala,              [36]   Chunting Zhou, Chonglin Sun, Zhiyuan Liu,
       Gregory Chanan, Edward Yang, Zachary                          and Francis Lau. A C-LSTM neural network for
       DeVito, Zeming Lin, Alban Desmaison, Luca                     text classification. arXiv preprint
       Antiga, and Adam Lerer. Automatic                             arXiv:1511.08630, 2015.
       differentiation in PyTorch. In NIPS-W, 2017.           [37]   Anastasia Ianina, Lev Golitsyn, and Konstantin
[26]   Jeffrey Pennington, Richard Socher, and                       Vorontsov. Multi-objective topic modeling for
       Christopher Manning. GloVe: Global vectors                    exploratory search in tech news. Conference on
       for word representation. In Proceedings of the                Artificial Intelligence and Natural Language,
       2014 conference on empirical methods in                       pages 181–193, 2017.
       natural language processing (EMNLP), pages             [38]   Konstantin Vorontsov, and Anna Potapenko.
       1532–1543, 2014.                                              Additive regularization of topic models.
[27]   Takeshi Sakaki, Makoto Okazaki, and Yutaka                    Machine Learning, pages 303–323, 2015.
       Matsuo. Earthquake shakes Twitter users: real-         [39]   Konstantin Vorontsov et al. BigARTM: Open
       time event detection by social sensors. In                    source library for regularized multimodal topic
       Proceedings of the 19th international                         modeling of large collections. International
       conference on World Wide Web, pages 851–                      Conference on Analysis of Images, Social
       860, 2010.                                                    Networks and Texts, pages 370–381, 2015.
[28]   Erich Schubert, Michael Weiler, and Hans-Peter         [40]   Roy Chowdhury S, Purohit H, Imran M. D-
       Kriegel. Signitrend: scalable detection of                    sieve: a novel data processing engine for
       emerging topics in textual streams by hashed                  efficient handling of crises-related social
       significance thresholds. In Proceedings of the                messages. InProceedings of the 24th
       20th ACM SIGKDD international conference                      International Conference on World Wide Web,
       on Knowledge discovery and data mining,                       pages 1227–1232, 2015.
       pages 871–880, 2014.                                   [41]   Zhang S, Vucetic S. Semi-supervised discovery
[29]   Sayan Unankard, Xue Li, and Mohamed A                         of informative tweets during the emerging
       Sharaf. Emerging event detection in social                    disasters. arXiv preprint arXiv:1610.03750.
       networks with location sensitivity. World Wide                2016.
       Web, pages 1393–1417, 2015.                            [42]   Li H, Caragea D, Caragea C, Herndon N.
[30]   Xuerui Wang and Andrew McCallum. Topics                       Disaster response aided by tweet classification
       over time: a non-markov continuous-time                       with a domain adaptation approach. Journal of
       model of topical trends. In Proceedings of the                Contingencies and Crisis Management, pages
       12th ACM SIGKDD international conference                      16-27, 2018.
       on Knowledge discovery and data mining,
       pages 424–433, 2006.
[31]   Yu Wang, Eugene Agichtein, and Michele
       Benzi. Tm-LDA: efficient online modeling of
       latent topic transitions in social media. In
       Proceedings of the 18th ACM SIGKDD
       international conference on Knowledge
       discovery and data mining, pages 123–131,
       2012.


                                                        215

</pre>