=Paper=
{{Paper
|id=Vol-2277/paper36
|storemode=property
|title=
Discovering Novel Emergency Events in Text Streams
|pdfUrl=https://ceur-ws.org/Vol-2277/paper36.pdf
|volume=Vol-2277
|authors=Artem Shelmanov,Dmitriy Deviatkin,Daniil Larionov
|dblpUrl=https://dblp.org/rec/conf/rcdl/ShelmanovDL18
}}
==
Discovering Novel Emergency Events in Text Streams
==
Discovering Novel Emergency Events in Text Streams © Dmitriy Deviatkin1 © Artem Shelmanov1 © Daniil Larionov2 devyatkin@isa.ru shelmanov@isa.ru dslarionov@protonmail.com 1 Federal Research Center “Computer Science and Control” of Russian Academy of Sciences, Moscow, Russia 2 People's Friendship University of Russia, Moscow, Russia Abstract. We present text processing framework for discovering emergency related events via analysis of information sources such as social networks. The framework performs focused crawling of messages, text parsing, information extraction, detection of messages related to emergencies, as well as automatic novel event discovering and matching them across different information sources. For detection of emergency- related messages, we use CNN and word embeddings. For discovering novel events and matching them across different sources, we propose a multimodal topic model enriched with spatial information and a method based on Jensen–Shannon divergence. The components of the framework are experimentally evaluated on Twitter and Facebook data. Keywords: event detection, topic modelling, monitoring, named entity recognition, text processing, novel topic. different locations at the same time despite they generate 1 Introduction topically similar text streams (e.g. destructions caused by a single storm that moves across a country should be Recent research showed that Twitter, Facebook, and identified as different events). other social networks have valuable applications in The task set in this work has a global spatial emergency situations. Since large-scale emergency restriction. In particular, we are interested primarily in events give rise to a massive publication activity in social the events and messages from the Arctic zone. This networks [35], these resources accumulate information restriction brings additional difficulties due to sparseness about situation in affected areas, infrastructure damage, of data, lack of ready-to-use software, methods, and casualties, requests and proposals for help. They have linguistic resources needed for text processing. already been used for enhancing situation awareness of In this work, we evaluate several models for detection affected people and emergency response teams [3, 21, of emergency related messages based on various types of 15], as well as for online detecting and monitoring embeddings and classification techniques including deep emergency events like earthquakes [27, 29]. Advanced learning. We present a multimodal topic model for event information retrieval techniques can detect emergencies discovering that leverages spatial information, as well as in text streams automatically so direct appeals to the describe approaches to assessing event novelty and rescue services through the standard channels may not be matching events from different information sources. The needed. experimental evaluations on collections of messages This research continues the previous studies from Twitter and Facebook show that our methods presented in [10, 11] that are devoted to monitoring outperform the baselines. restricted geographical regions via social networks for The rest of the paper is structured as follows. Section enhancing situation awareness during emergency 2 reviews the related work on methods for novel situations. In this work, we solve the task of automatic topic/event detection in text streams. Section 3 describes identification of emergency events in a stream of text the natural language pipeline of our system including the messages. We consider an event in a text stream as a subsystem for extraction of emergency related messages. group of topically related messages that reflect a real-life Section 4 presents the developed method for novel event in a small time period. Since we are looking for emergency event discovering and matching across emergency events, it is crucial to detect them as soon as information sources. Section 5 describes the possible: long before they become trendy and gain high experimental evaluation of methods. Section 6 concludes amount of publications. Therefore, one of the and outlines the future work. peculiarities of this task is the problem of identification of novel topics that correspond to emergency events. It is also important to distinguish events (earthquakes, fire breakouts, storms, hurricanes, etc.) that happen in Proceedings of the XX International Conference “Data Analytics and Management in Data Intensive Domains” (DAMDID/RCDL’2018), Moscow, Russia, October 9-12, 2018 208 2 Related Work detection of emerging keywords. They consider frequencies of words as signals and decode these signals The work related to our current research includes with wavelet analysis. Some trivial words are filtered publications considering the tasks of event detection in away by analyzing their corresponding signal microblogs, topic evolution tracking, as well as emerging autocorrelations. The remaining words are then clustered topic detection. Most of the approaches to these problems to form events with a modularity-based graph can be divided into two major groups. partitioning technique. In [8], a real-time framework for The first group of methods for emerging event detecting hot emerging topics for organizations in social detection and tracking primarily relies on topic models media context is presented. Authors discover emerging adopted to temporal aspects of the task. They are based topics and extract emerging features from both the on different modifications of PLSA models [13] (often organization and topic perspectives. They extract LDA [6]). One of the fundamental works in this area is emerging terms by leveraging chi-square test for [5]. It proposes several dynamic topic models that align foreground and background distributions of terms. topics across time steps with logistic normal distribution, Topics are discovered by incremental k-means type train with approximation based on variational Kalman clustering algorithm. To perform timely identification of filters and perform inference with the help of wavelet hot emerging topics, authors proposed two semi- regression. Another fundamental model named “topics supervised classifiers (based on co-training and self- over time” is presented in [30]. Authors propose a learning). Authors engineered several features that method for jointly modelling both word co-occurrences incorporate an authority of a source, importance of and localization in continuous time without employing keywords, number of retweets, and some other aspects. Markov assumption. Another topic model that takes into In [28], the emerging keywords are identified using account temporal dimension is on-line LDA presented in significance measure based on outlier detection [1]. In this approach, distributions generated on the algorithm. More specifically, authors used exponentially previous time steps are used as priors for word generation weighted average of terms and co-occurring terms. For on the current step. For each topic, the method builds detection of novel events, in [20], researchers propose to transformation matrix that captures the evolution of the use instead of single unigrams so called “event topic over time. Authors consider a topic as emerging if segments” – key phrases for an event that possibly refer it is significantly different from topics in the same time to named entities or semantically meaningful period or from all topics seen before. For topic information units. They cluster event segments into comparison, Kullback-Leibler divergence is used. In events considering both their frequency distribution and [31], researchers instead of creating monolith Bayesian content similarity. Emerging segments are detected by model propose to learn a topic model and a transition abnormal frequency distribution of the tweet and user matrix to shift distributions over discrete time steps. frequencies of the segments. Importance of an event is They formulate the problem of model learning as also determined by Wikipedia. Authors consider minimizing the least square error between predicted segments that frequently appear as anchors in Wikipedia topic distribution using transformation and the actual more favorable. This approach is intended for finding the topic distribution of new documents. The proposed most realistic events and to derive the most newsworthy approach provides the ability to predict topic trends in segments to describe the identified events. the future. Other notable related work on topic models The method presented in [14] combines two for emerging topic detection in microblog data include aforementioned approaches: it uses topic modelling in Twitter-LDA [12], BBTM (bursty biterm topic model) conjunction with models for emerging terms detection. Topic [34], and TopicSketch [33]. models are used to detect topic distributions in each time The second group of methods is based on detection interval. Term novelty is estimated by local weighted linear of emerging features like terms, keywords, or token regression. In order to advance from detection of term novelty segments, and clustering of them. In [7], to define to detection of topic novelty, authors solve optimization emerging terms authors use two metrics named problem. The solution gives novelty and fading probabilities “nutrition” and “energy function” (biology metaphor). for a topic. Based on these two probabilities, topic evolution Nutrition of a term is calculated as a sum of modified operations are defined subsequently to identify emerging term frequency in a tweet multiplied by author topics from the large number of latent ones and track how importance (calculated via PageRank) summed through these topics evolve over time. To compare topics, authors use all tweets in a time period. The energy function of a term Jensen-Shannon distance. is proportional to the difference of its current nutrition Another approach to emergency event detection employ and its nutrition in the previous time intervals. Authors dictionary learning method [17]. The dictionary contains declare a term as emerging if its energy value is more topics, which are consist of atoms (numerical vectors). Vector than “critical drop” value, which is proportional to the representation of documents can be approximated with a average energy of all terms in the current time period. linear combination of such atoms. The method consists of Using cooccurrence of terms, authors build a graph with two steps: determining novel documents in a text stream and edges that correspond to the strongest relationships identifying a cluster structure among the novel documents. In between terms. The emerging terms become seeds of the first step, the method checks whether a new document can strongly connected components that finally represent be represented as a sparse linear combination of known atoms emerging topics. Authors of [32] use wavelet analysis for with low error. If it is not the case, the document is considered 209 novel. Such documents are used to learn a new dictionary of achieve the ability to normalize extracted textual novel topics. On the second step, the learned dictionary is information into geographic coordinates, in the previous used to build clusters of similar novel messages. These work, we implemented a rule- and dictionary-based clusters are considered as emerging topics. module [10]. We created a gazetteer from Geonames 20 Our approach to novel event discovering is based on and supplied it with several filtering rules based on multimodal topic modeling and takes into account spatial postags of extracted tokens. Geonames also provides information. Its key benefits compared to the previous mapping of locations into the geographic coordinates. work are the following. To extract and normalize temporal expressions, we use ● It allows to separate similar emergency events a combination of two tools: spaCy 21 (NLP framework happened in different locations (for example, storms based on deep learning) and a datetimeparser 22 (a library or typhoons). based on a set of hand-crafted rules). ● It provides an obvious way to match messages from For extraction of ship names, in the previous work different sources (social networks) taking into [11], we implemented a hybrid approach. On the basis of account location information. a database of ship names, we implemented a gazetteer ● It can help to reveal location information of an event that has high recall but low precision due to the fact that from a set of scattered messages. many generic words appear to be ship names. To mitigate this problem, we also trained a neural network based on 3 Natural Language Processing Pipeline C-LSTM architecture [36]. The network filters out erroneous cases generated by the gazetteer and Our method for event discovering needs complex drastically improves precision and overall F1-score of preprocessing of natural language texts. We perform ship name detection. basic linguistic analysis, named entity recognition, time recognition, and detection of emergency related texts. 3.3 Detection of Emergency Related Messages The final results of the natural language processing For detection of emergency related tweets, in the pipeline are used for three tasks: focused crawling, previous work, we also used a combination of a gazetteer enriching information about events, creating and a neural network based on C-LSTM architecture. modularities for topic models. The gazetteer is based on the CrisisLex lexicon, proposed 3.1 Basic Linguistic Analysis in [23]. This gazetteer generates many false positives that are filtered out by the neural network. To create this The basic linguistic analysis includes tokenization, solution, in the previous work, we collected a corpus of sentence splitting, pos-tagging, lemmatization, and tweets and trained a neural network on it. In this work, syntax parsing. The pipeline is implemented via we improve the module for detection of emergency IsaNLP 19 – a library that organizes various NLP related messages by incorporating more labeled data components for English and Russian. In this paper, we from CrisisLex corpora [24] and by exploring: perform experiments only with English texts, therefore, ● Various embeddings: word-level: fastText [16] the constructed pipeline contains only components for (trained on our own corpus / pre-trained on English parsing English. Wikipedia), GloVe [26] (Common Crawl with Tokenization, sentence splitting, postagging, and dimension 300 / Twitter with dimension 200), lemmatization are performed by components based on Word2Vec [22], sentence-level: InferSent [9]. NLTK toolkit [4]. The syntax parsing is performed by ● Various types of models: logistic regression (from SyntaxNet McParseface [2]. scikit-learn), random forest (from scikit-learn), gradient boosting on decision trees (LigthGBM 3.2 Named Entity Recognition algorithm [18]), fully-connected network (FCN), We perform extraction of the following types of objects: convolutional neural network (CNN), and C-LSTM person’s names, organizations, geographical locations, as before. and ship names. For basic NER extraction, we use Polyglot framework. This system uses distant supervision on Wikipedia for learning underlying model and is able to perform named entity recognition for 40 languages. However, we note that performance of such an approach is not suitable for location extraction due to lack of recall. High recall of spatial information is needed to perform filtering of the text stream and topic modelling. Wikipedia lacks many miscellaneous locations, therefore, there is not enough data for training a good model. Polyglot also lacks the ability to normalize locations. To improve the recall of location extraction and 19 https://github.com/IINemo/isanlp 21 https://spacy.io/ 20 http://www.geonames.org/ 22 https://github.com/scrapinghub/dateparser 210 Facebook Linguistic processing Natural language processing Information Natural language processing extraction Information (locations, Detect Linguistic extraction objects, etc.) emergency Topic crawling of processing (locations, messages Facebook Detect objects, etc.) emergency messages Twitter topic Multimodal topic Generate queries Check topic crawling modeling for Facebook similarity Filter Twitter Find novel New emergency background events messages topics Figure 1. Emergency event detection process For logistic regression, random forest, gradient from other sources (Facebook in particular). Then, we boosting algorithms, as well as for FCN we averaged apply emergency detection method again and filter out word embeddings and used the result vector as features. all irrelevant posts. The trained topic model is used to Word-level embeddings in C-LSTM and CNN were check whether the remaining messages are topically processed in a standard way. Sentence-level embeddings similar to the events extracted from Twitter. were not used in C-LSTM and CNN since these architectures work only with sequences. For the rest 4.1 Identification of Events algorithms, sentence-level embeddings were used as In the first step, we discretize the timeline into small time common features. periods (one day in the experiments). In each time period, The fully-connected network is a simple 2-layer multimodal topic model with additive regularization [37, perceptron with dropout in the middle. The first layer 38] is trained. activation function is ReLU, the outputs of the last layer Let 𝐷𝐷 be a collection of tweets from a time period, let are passed through the softmax. The architecture of Def be a default modality (regular event-related lexis) convolutional neural network for sentence classification and let Loc be a modality devoted to location of events. was proposed in [19]. In this architecture, padded The main reason to use such modalities is to separate sequence of word embeddings is processed by a one- similar events happened in different places in one period dimensional convolution layer, followed by max pooling of time. We consider each message 𝑑𝑑 ∈ 𝐷𝐷 as a set of layer to reduce dimensionality. The result vectors are tokens, related to those modalities 𝑊𝑊 = 𝑊𝑊𝑑𝑑𝑑𝑑𝑑𝑑 ∪ 𝑊𝑊𝑙𝑙𝑙𝑙𝑙𝑙 . stacked into a single one and are fed into fully-connected The goal of the topic modeling is to find factorization for layer to make a prediction. Activation functions for matrix of empirical probabilities for documents and convolutional and fully-connected layers are set to ReLU tokens: and softmax respectively. The architecture of C-LSTM consists of 1-d convolution layer with ReLU activation 𝑝𝑝̂ (𝑤𝑤|𝑑𝑑) ≈ 𝑝𝑝(𝑤𝑤|𝑑𝑑) = and max pooling followed by a LSTM recurrent layer. ( ∑𝑡𝑡∈𝑇𝑇 𝑝𝑝(𝑤𝑤|𝑡𝑡)𝑝𝑝(𝑡𝑡|𝑑𝑑) = The final predictions are made by two dense layers with ∑𝑡𝑡∈𝑇𝑇 𝜑𝜑𝑤𝑤𝑤𝑤 𝜃𝜃𝑡𝑡𝑡𝑡 , ∀𝑤𝑤 ∈ 𝑊𝑊 . 1) hyperbolic tangent and softmax activations. Neural networks were implemented with PyTorch [25]. This problem could be solved by maximizing the weighted sum of the following log-likelihoods with 4 Emergency Event Detection Method additive regularizers: The pipeline for emergency event detection is depicted in 𝐿𝐿(𝛷𝛷, 𝛩𝛩) Figure 1. In the first step, we collect all messages from = � 𝛾𝛾 � � 𝑛𝑛𝑑𝑑𝑑𝑑 𝑙𝑙𝑙𝑙 � 𝜑𝜑𝑤𝑤𝑤𝑤 𝜃𝜃𝑡𝑡𝑡𝑡 + Twitter using topic search API [11] and crisis-related (2) 𝛾𝛾∈𝛤𝛤 𝑑𝑑∈𝐷𝐷 𝑤𝑤∈𝑊𝑊𝛾𝛾 𝑡𝑡∈𝑇𝑇 lexicon. Then, we detect emergency related messages among crawled tweets using methods described in section 𝛼𝛼𝑅𝑅𝑠𝑠𝑠𝑠 (𝛩𝛩) + 𝛽𝛽𝑅𝑅𝑠𝑠𝑠𝑠 (𝛷𝛷) + 𝜏𝜏𝑅𝑅𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 (𝛷𝛷𝑙𝑙𝑙𝑙𝑙𝑙 ) 3.3 and filter out all irrelevant tweets. → 𝑚𝑚𝑚𝑚𝑚𝑚𝛷𝛷,𝛩𝛩 . In the second step, we train multimodal topic model to identify emergency events described by messages and then determine novel events among them by comparing term distributions of the events from adjacent time periods. In the third step, we use event-related and location- related lexis from the obtained topics to crawl messages 211 Table 3. Results of the models for emergency-related message detection (F1-score), % Embedding features Models FstTrain FstWiki GloveCC GloveTwt W2V InferSent LogReg 87.4±8.4 82.5±9.2 88.6±5.3 85.1±6.9 88.9±6.7 89.4±4.9 Rnd For. 86.9±9.5 82.3±11.1 87.4±7.4 83.9±10.5 87.4±8.9 89.4±4.9 GBDT 91.7±0.1 89.8±0.1 93.0±0.1 89.8±0.2 92.0±0.2 N/A FCN 90.9±0.3 89.8±0.1 92.2±0.3 88.0±0.2 91.2±0.3 90.8±0.2 CNN 94.3±0.3 93.4±0.3 93.8±0.2 92.7±0.2 92.9±0.2 N/A CLSTM 92.1±0.2 92.2±0.3 92.2±0.6 91.5±0.5 92.3±0.5 N/A Here 𝛾𝛾 ∈ Γ = {𝛾𝛾𝑑𝑑𝑑𝑑𝑑𝑑 , 𝛾𝛾𝑙𝑙𝑙𝑙𝑙𝑙 } are weights of the earlier similar topics in a predefined time window. modalities, Φ is a matrix of token probabilities for topics, and Θ is a matrix of topic probabilities for documents. As 4.3 Events Matching in [37], we apply smooth-sparse regularizers to achieve In the third step, we match messages related to the same smooth term distributions in topics and sparse topic event from different sources, which can be various types distributions in messages: of social networks or mass media sites. In experiments, we enriched messages from Twitter related to novel 𝑅𝑅𝑠𝑠𝑠𝑠 (𝛷𝛷) = � 𝐾𝐾𝐾𝐾(𝛽𝛽𝑡𝑡 ||𝜑𝜑𝑤𝑤𝑤𝑤 ), (3) emergency events with Facebook public posts. For each 𝑡𝑡∈𝑇𝑇 novel event, we construct a search query as a combination of default and location tokens with the highest weights. To crawl Facebook, we use Ghost.py 23 𝑅𝑅𝑠𝑠𝑠𝑠 (𝛩𝛩) = − � 𝐾𝐾𝐾𝐾(𝛼𝛼𝑑𝑑 ||𝜃𝜃𝑡𝑡𝑡𝑡 ), (4) library. 𝑑𝑑∈𝐷𝐷 We filter obtained posts (leaving only emergency where 𝛼𝛼𝑑𝑑 and 𝛽𝛽𝑡𝑡 are sampled from some predefined related messages) as described in section 3.3 and extract distributions. named entities and locations from them. We infer topic- We apply decorrelation regularizer only for location probabilities matrix Θ � for remaining posts using the modality to be able to detect similar events happened in pretrained model for the event. Then, we filter all different places at the same time: messages, which are not topically similar to the event. Due to the use of multimodal models, information about 𝑅𝑅𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 (𝛷𝛷𝑙𝑙𝑙𝑙𝑙𝑙 ) = − � � 𝜑𝜑𝑤𝑤𝑤𝑤 𝜑𝜑𝑤𝑤𝑤𝑤 . (5) locations is also taken into account when assessing the 𝑡𝑡,𝑠𝑠∈𝑇𝑇 𝑤𝑤∈𝑊𝑊𝑙𝑙𝑙𝑙𝑙𝑙 similarity of posts. We use BigARTM library [39] to train multimodal 5 Experiments models. The result is Φ and Θ matrices for each time period. After that, “background” topics with high 5.1 Detection of Emergency Related Messages entropy of token distributions can be filtered. Dataset and Pre-processing 4.2 Detection of Novel Events For evaluation of method for detection of emergency In the second step, we determine whether the extracted related messages, we use the CrisisLexT6 dataset. The events were discussed before. We aggregate several dataset consists of 60,000 tweets related to 6 major crisis adjacent periods of time to “time windows”. Consider we situations. Emergency related tweets are labeled as “on- have topics s and t in the same time window. Denote topic” and others are labeled as “off-topic”. The pre- vectors of token distributions for these topics as Φ𝑡𝑡 processing procedure included elimination of the special and Φ𝑠𝑠 . As in [14], we use Jensen–Shannon divergence characters, as well as conversion of hashtags, emojis, and between token probabilities for the topics to estimate URLs into single tokens. topic similarity: Hyperparameters 1 𝐽𝐽𝐽𝐽𝐽𝐽(𝛷𝛷𝑡𝑡 ||𝛷𝛷𝑠𝑠 ) = �𝐾𝐾𝐾𝐾(𝛷𝛷𝑡𝑡 ||𝑀𝑀) Logistic regression. Regularization: L2 penalty. 2 Tolerance: 0.0001. Inverse regularization strength: 1.0. + 𝐾𝐾𝐾𝐾(𝛷𝛷𝑠𝑠 ||𝑀𝑀)�, (6) Random Forest. Number of estimators: 1,000. No limits to maximum number of features and tree depth. 1 𝑀𝑀 = (𝛷𝛷𝑡𝑡 + 𝛷𝛷𝑠𝑠 ). Split quality measure: Gini impurity. Min number of 2 samples per split: 2. Min number of samples per leaf: 1. A topic is denoted as a “new event” if there is no Gradient boosting. Maximum tree depth: 20. Number 23 https://github.com/jeanphix/Ghost.py 212 of leaves: 11. Learning rate: 0.05. Feature fraction 0.9. method, we filtered out posts that were considered Bagging fraction: 0.8. Min frequency: 5. Number of irrelevant to events extracted from Twitter. After estimators: 4,000 with early stopping for 200. filtering, 1k Facebook posts left. FCN. Size of hidden layer: 256. Dropout: 0.5. Hyperparameters Number of epochs: 10. Loss: cross entropy. Optimization algorithm: Adam. Learning rate: 0.0001. Weight decay: In our experiments, we applied grid search to tune 0. Batch size: 256. weights of the regularizers for topic models. A criterion CNN. Kernel size: [3, 4, 5]. Number of filters: 512. for the search was a weighted sum of model perplexity, Dropout: 0.5. Optimization algorithm: Adam. Learning model’s matrices sparsity and model’s pointwise mutual rate: 0.0001. Loss: binary cross entropy. Batch size: 128. information. Vocabulary size: 10,001. Number of epochs: 10 with early stopping for 3 epochs. Results and Discussion Results and Discussion Since the experiments were conducted on open data, we estimated only precision of models. The We use 5-fold cross-validation for evaluation. Results results are presented in Table 2. The experiment are presented in Table 1. We discovered several insights into problems with processing and analyzing crisis and shows that the proposed approach outperforms Twitter specific lexicon: baseline LDA models. This confirms the importance ● Sentence-level embeddings are better than of using information about the locations in the averaging word vectors. Averaging embeddings of framework. One can note relatively low precision all words in a tweet blur the real meaning of text. for the events matching. We believe this is due to InferSent embedding model, which is constructed substantial lag of time between the message using NLI data and BiLSTM encoders, treats sentence as a single entity and performs more crawling and the event matching experiments. general projection process. But the higher Thus, true event-related posts may be treated by dimensionality (required to make accurate Facebook’s search as less actual than others. projections) makes it harder to use several classification algorithms. Table 2. Results of the novel emergency event ● GloVe embeddings pretrained on a Common Crawl extraction method (Precision), % corpus show better results than Twitter specific Step LDA Multimodal embeddings. Sentence-level embeddings, pretrained (baseline) model on non-specific natural language inference data, also All events 63.3 93.3 show superior results. It seems reasonable that crisis-related lexicon differs from common Twitter Novel events 71.4 80.0 lexicon and tends to be closer to common lexicon. Event matching 60.0 67.0 However, we should note that there is a lack of publicly available Twitter data for training. GloVe 6 Conclusion Twitter corpus contains only 27 billion words, which is much less compared to Common Crawl We considered several problems related to monitoring of corpus size of 840 billion words. social networks: detection of messages related to ● All neural network models have lower standard emergencies, extraction of novel events, and matching deviation of F1-score compared to other machine events reflected in different text sources. For detection of learning algorithms (except GBDT). Therefore, the emergency-related messages, we use CNN and word quality of neural networks could be much stable on embeddings. For extraction of novel events and matching unseen data and less sensitive to the context. them across different sources, we propose a multimodal ● Our best classifier (CNN for text classification + topic modelling enriched with spatial information and fastText, trained on our dataset) outperforms models Jensen–Shannon divergence. presented in the related work [40, 41, 42]. We investigated the performance of different algorithms and embeddings for emergency-related 5.2 Novel Emergency Event Extraction message detection on CrisisLexT6 dataset and found that Dataset and Pre-processing the best solution is given by CNN with fastText embeddings. We also compared the proposed We crawled 60k Twitter messages from April 1, 2018 to multimodal topic model and the LDA baseline. The April 12, 2018 using the focused crawler presented in experimental results are promising and show that the [11]. With the help of CNN neural network, we filtered proposed framework could be useful for monitoring out messages that are not related to emergency events, emergency events via messages in social media. which reduced the number of tweets in the dataset to In the future work, we are going to address the problem 5,200. The remaining tweets were analyzed with the of emergency event locating and create visualization tools natural language processing pipeline and with the event for presenting them on a geographic map. discovery method. After that, we also crawled Facebook posts for each extracted event. Using the developed 213 Acknowledgments. The project is supported by the detection in the Arctic zone. Communications in Russian Foundation for Basic Research, project Computer and Information Science, pages 74– numbers: 15-29-06082, 15-29-06045 “ofi_m”. 88, 2017. [12] Qiming Diao, Jing Jiang, Feida Zhu, and Ee- References Peng Lim. Finding bursty topics from microblogs. In Proceedings of the 50th Annual [1] Loulwah AlSumait, Daniel Barbará, and Meeting of the Association for Computational Carlotta Domeniconi. On-line LDA: Adaptive Linguistics, pages 536–544, 2012. topic models for mining text streams with [13] Thomas Hofmann. Probabilistic latent semantic applications to topic detection and tracking. In indexing. In Proceedings of the 22nd annual Data Mining, 2008. ICDM’08. Eighth IEEE international ACM SIGIR Conference on International Conference on, pages 3–12, 2008. Research and Development in Information [2] Daniel Andor, Chris Alberti, David Weiss, Retrieval, pages 50–57, 1999. Aliaksei Severyn, Alessandro Presta, Kuzman [14] Jiajia Huang, Min Peng, Hua Wang, Jinli Cao, Ganchev, Slav Petrov, and Michael Collins. Wang Gao, and Xiuzhen Zhang. A probabilistic Globally normalized transition-based neural method for emerging topic tracking in networks. In Proceedings of the 54th Annual microblog stream. World Wide Web, pages Meeting of the Association for Computational 325–350, 2017. Linguistics, pages 2442–2452, 2016. [15] Muhammad Imran, Carlos Castillo, Ji Lucas, [3] Zahra Ashktorab, Christopher Brown, Manojit Patrick Meier, and Sarah Vieweg. AIDR: Nandi, and Aron Culotta. Tweedr: Mining Artificial intelligence for disaster response. In Twitter to inform disaster response. Proceedings of the companion publication of Proceedings of ISCRAM, pages 354–358, 2014. the 23rd International Conference on World [4] Steven Bird, Ewan Klein, and Edward Loper. Wide Web Companion, pages 159–162, 2014. Natural language processing with Python: [16] Armand Joulin, Edouard Grave, Piotr analyzing text with the natural language toolkit, Bojanowski, and Tomas Mikolov. Bag of tricks 2009. for efficient text classification. In Proceedings [5] David M Blei and John D Lafferty. Dynamic of the 15th Conference of the European Chapter topic models. In Proceedings of the 23rd of the Association for Computational international conference on Machine learning, Linguistics, pages 427–431, 2017. pages 113–120, 2006. [17] Shiva Prasad Kasiviswanathan, Prem Melville, [6] David M Blei, Andrew Y Ng, and Michael I Arindam Banerjee, and Vikas Sindhwani. Jordan. Latent dirichlet allocation. Journal of Emerging topic detection using dictionary machine Learning research, pages 993–1022, learning. In Proceedings of the 20th ACM 2003. international conference on Information and [7] Mario Cataldi, Luigi Di Caro, and Claudio knowledge management, pages 745–754, 2011. Schifanella. Emerging topic detection on [18] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Twitter based on temporal and social terms Wang, Wei Chen, Weidong Ma, Qiwei Ye, and evaluation. In Proceedings of the tenth Tie-Yan Liu. LightGBM: A highly efficient international workshop on multimedia data gradient boosting decision tree. In Advances in mining, 2010. Neural Information Processing Systems, pages [8] Yan Chen, Hadi Amiri, Zhoujun Li, and Tat- 3149–3157, 2017. Seng Chua. Emerging topic detection for [19] Yoon Kim. Convolutional neural networks for organizations from microblogs. In Proceedings sentence classification. In Proceedings of the of the 36th international ACM SIGIR 2014 Conference on Empirical Methods in conference on Research and development in Natural Language Processing (EMNLP), pages information retrieval, pages 43–52, 2013. 1746–1751, 2014. [9] Alexis Conneau, Douwe Kiela, Holger [20] Chenliang Li, Aixin Sun, and Anwitaman Datta. Schwenk, Loc Barrault, and Antoine Bordes. Twevent: segment-based event detection from Supervised learning of universal sentence tweets. In Proceedings of the 21st ACM representations from natural language inference international conference on Information and data. In Proceedings of the 2017 Conference on knowledge management, pages 155–164, 2012. Empirical Methods in Natural Language [21] Alan M. MacEachren, Anuj Jaiswal, Processing, pages 670–680, 2017. Anthony C. Robinson, Scott Pezanowski, [10] D. Deviatkin and A. Shelmanov. Towards text Alexander Savelyev, Prasenjit Mitra, Xiao processing system for emergency event Zhang, and Justine Blanford. SensePlace2: detection in the Arctic zone. In Proceedings of GeoTwitter analytics support for situational Data Analytics and Management in Data awareness. In Proceedings of Visual Analytics Intensive Domains, pages 225–232, 2016. Science and Technology (VAST) on IEEE [11] D. Devyatkin and A. Shelmanov. Text Conference, pages 181–190, 2011. processing framework for emergency event 214 [22] Tomas Mikolov, Ilya Sutskever, Kai Chen, [32] Jianshu Weng and Bu-Sung Lee. Event Greg S Corrado, and Jeff Dean. Distributed detection in Twitter. ICWSM, pages 401–408, representations of words and phrases and their 2011. compositionality. In Advances in neural [33] Wei Xie, Feida Zhu, Jing Jiang, Ee-Peng Lim, information processing systems, pages 3111– and Ke Wang. Topicsketch: Real-time bursty 3119, 2013. topic detection from Twitter. IEEE [23] Alexandra Olteanu, Carlos Castillo, Fernando Transactions on Knowledge and Data Diaz, and Sarah Vieweg. CrisisLex: A lexicon Engineering, pages 2216–2229, 2016. for collecting and filtering microblogged [34] Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Jun communications in crises. In Proceedings of Xu, and Xueqi Cheng. A probabilistic model for ICWSM, 2014. bursty topic discovery in microblogs. In AAAI, [24] Alexandra Olteanu, Sarah Vieweg, and Carlos pages 353–359, 2015. Castillo. What to expect when the unexpected [35] Jie Yin, Sarvnaz Karimi, Bella Robinson, and happens: Social media communications across Mark Cameron. ESA: emergency situation crises. In Proceedings of the 18th ACM awareness via microbloggers. In Proceedings of Conference on Computer Supported the 21st ACM International Conference on Cooperative Work & Social Computing, pages Information and Knowledge Management, 994–1009, 2015. pages 2701–2703, 2012. [25] Adam Paszke, Sam Gross, Soumith Chintala, [36] Chunting Zhou, Chonglin Sun, Zhiyuan Liu, Gregory Chanan, Edward Yang, Zachary and Francis Lau. A C-LSTM neural network for DeVito, Zeming Lin, Alban Desmaison, Luca text classification. arXiv preprint Antiga, and Adam Lerer. Automatic arXiv:1511.08630, 2015. differentiation in PyTorch. In NIPS-W, 2017. [37] Anastasia Ianina, Lev Golitsyn, and Konstantin [26] Jeffrey Pennington, Richard Socher, and Vorontsov. Multi-objective topic modeling for Christopher Manning. GloVe: Global vectors exploratory search in tech news. Conference on for word representation. In Proceedings of the Artificial Intelligence and Natural Language, 2014 conference on empirical methods in pages 181–193, 2017. natural language processing (EMNLP), pages [38] Konstantin Vorontsov, and Anna Potapenko. 1532–1543, 2014. Additive regularization of topic models. [27] Takeshi Sakaki, Makoto Okazaki, and Yutaka Machine Learning, pages 303–323, 2015. Matsuo. Earthquake shakes Twitter users: real- [39] Konstantin Vorontsov et al. BigARTM: Open time event detection by social sensors. In source library for regularized multimodal topic Proceedings of the 19th international modeling of large collections. International conference on World Wide Web, pages 851– Conference on Analysis of Images, Social 860, 2010. Networks and Texts, pages 370–381, 2015. [28] Erich Schubert, Michael Weiler, and Hans-Peter [40] Roy Chowdhury S, Purohit H, Imran M. D- Kriegel. Signitrend: scalable detection of sieve: a novel data processing engine for emerging topics in textual streams by hashed efficient handling of crises-related social significance thresholds. In Proceedings of the messages. InProceedings of the 24th 20th ACM SIGKDD international conference International Conference on World Wide Web, on Knowledge discovery and data mining, pages 1227–1232, 2015. pages 871–880, 2014. [41] Zhang S, Vucetic S. Semi-supervised discovery [29] Sayan Unankard, Xue Li, and Mohamed A of informative tweets during the emerging Sharaf. Emerging event detection in social disasters. arXiv preprint arXiv:1610.03750. networks with location sensitivity. World Wide 2016. Web, pages 1393–1417, 2015. [42] Li H, Caragea D, Caragea C, Herndon N. [30] Xuerui Wang and Andrew McCallum. Topics Disaster response aided by tweet classification over time: a non-markov continuous-time with a domain adaptation approach. Journal of model of topical trends. In Proceedings of the Contingencies and Crisis Management, pages 12th ACM SIGKDD international conference 16-27, 2018. on Knowledge discovery and data mining, pages 424–433, 2006. [31] Yu Wang, Eugene Agichtein, and Michele Benzi. Tm-LDA: efficient online modeling of latent topic transitions in social media. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 123–131, 2012. 215