=Paper= {{Paper |id=Vol-1752/paper25 |storemode=property |title= Towards Text Processing System for Emergency Event Detection in the Arctic Zone |pdfUrl=https://ceur-ws.org/Vol-1752/paper25.pdf |volume=Vol-1752 |authors=Dmitriy Deviatkin,Artem Shelmanov |dblpUrl=https://dblp.org/rec/conf/rcdl/DeviatkinS16 }} == Towards Text Processing System for Emergency Event Detection in the Arctic Zone == https://ceur-ws.org/Vol-1752/paper25.pdf
                  Towards Text Processing System
           for Emergency Event Detection in the Arctic Zone
                 © Dmitriy Deviatkin                                                 © Artem Shelmanov
                      Federal Research Center “Computer Science and Control” of
                                    Russian Academy of Sciences,
                                          Moscow, Russia
                     devyatkin@isa.ru                         shelmanov@isa.ru



                        Abstract                                     unstructured textual data for support of search and rescue
                                                                     operations, as well as for helping people in affected areas.
We present the ongoing work on text processing system                The Arctic zone is a hard but important and promising
for detection and analysis of events related to                      region that has a lot of potential for the development. The
emergencies in the Arctic zone. The peculiarity of the               remarkable peculiarity of the chosen domain is data
task consists in data sparseness and scarceness of tools /           sparseness and scarceness of tools / language resources
language resources for processing such specific texts.               for processing such specific data, which poses a difficult
The system performs focused crawling of documents                    problem.
related to emergencies in the Arctic region, text parsing                The most significant features of the system are
including named entity recognition and geotagging, and               focused crawling and faceted search.
indexing texts with their metadata for faceted search. The               Since it is impossible to store all available data on the
system aims at processing both English and Russian text              web, the developed system is designed to accumulate
messages and documents. We report the preliminary                    only data related to emergencies in the Arctic zone from
results of the experimental evaluation of the system                 multiple textual streams. The sources of such information
components on Twitter data.                                          include but are not limited to mass media, social
                                                                     networks, reports (e.g., official sources like national
Keywords: focused crawling, event detection,                         transportation safety boards1,2). The focused crawler is
monitoring, named entity recognition, text processing,               intended to narrow down the amount of indexed text and
information search                                                   extract basic metadata of downloaded documents. At first
                                                                     sight, the problem of crawling messages about
1 Introduction                                                       emergency events is very similar to topic crawling. The
                                                                     key difference lies in the fact that emergency related
    Due to ever-growing amounts of data available on the             messages can be devoted to multiple topics and the
web, monitoring and searching in textual streams is still
                                                                     composition of these topics can change over time. It
one of the most urgent problems today that has inspired              means that using the ordinal topical approaches leads to
researchers to develop many general-purpose                          inappropriate accuracy and laboriousness of the crawling
information-retrieval methods and systems. However,
                                                                     process. To mitigate this problem, we have implemented
the development of applications for specific domains                 the following ideas in the proposed framework:
often reveals lack of suitable techniques that could
                                                                          Multiple topic crawlers with narrow focuses
address challenging tasks arising in these domains, which
                                                                              outperform a single data collecting process in
require significant research.
                                                                              terms of recall.
    This paper describes an ongoing development of a
                                                                          Geographical coordinate extracting and
search and monitoring system for a specific domain and
                                                                              considering them for further filtering improve the
a task. It is oriented on detection and analysis of
                                                                              accuracy of the crawling process. One could get
emergency events in the Arctic zone. Since a lot of
                                                                              topically irrelevant, but important messages from
textual information is generated during emergencies and
                                                                              emergency zone.
crises, as during major events of other types, it is crucial
to have automated tools for filtering and processing of                   Topic models for crawled texts could be
                                                                              periodically built and verified for better tracking
                                                                              of topic shifts in text streams.
Proceedings of the XVIII International Conference                         Reposts and fuzzy duplicates can be effectively
«Data Analytics and Management in Data Intensive                              detected via inverted full-text indices [28].
Domains» (DAMDID/RCDL’2016), Ershovo, Russia,                            The faceted search provides the abilities to retrieve
October 11 - 14, 2016                                                and analyze texts in different perspectives: topic, time,

1                                                                    2
 http://www.tsb.gc.ca/eng/rapports-                                   http://www.ntsb.gov/investigations/AccidentReports/Pa
reports/marine/index.asp                                             ges/marine.aspx




                                                               148
location, relations with the given object, etc. The                 The clusters are considered as events found in an
developed system performs deep natural language                     information stream. Researchers tested the system on the
processing of texts (including syntax parsing and                   data acquired during hurricane Sandy. They showed that
semantic role labeling), named entity recognition, as well          the system could be used for searching messages from
as geotagging. The extracted metadata is indexed for the            affected people considering their location.
faceted search.                                                          Another monitoring system SensePlace2, described
    We evaluated the developed subsystems for                       in [12], specializes on analysis of the geographical data
geotagging, crawling, and faceted search on the data                extracted from tweets. The system aims at improving the
acquired from Twitter. Although this social network                 situational awareness during search and rescue
accumulates only short messages and is not designed for             operations. The main goal of the system is text stream
providing data for the considered tasks, many                       filtering and searching of messages related to the given
researchers, as shown in Section 2, demonstrated that               topic, place, and time. The system utilizes the
tweets could be a useful source of information about                geographical tags, as well as the information extracted
emergencies. When common communication services                     from message texts. Besides text, Senseplace2 also
are down, Twitter provides a channel, which is used by              indexes geographical and temporal information of
affected people and emergency response teams [22].                  messages. This enables the system to filter a message
Therefore, we used messages crawled from Twitter for                stream by place and time and build analytical reports for
preliminary experiments, testing our approaches, and                topic-time-location data. Senseplace2 can visualize
evaluation of the system components. However, we note               results in different ways: as a common search result list,
that the developed system is designed to handle all sorts           present them on a time scale as a histogram, and visualize
of textual information, not just short messages.                    results on a heat-map, which displays the intensity of
    The rest of the paper is organized as follows. Section          messages about particular topic near the given location.
2 reviews the related work about monitoring emergency               Researchers tested the system using data related to the
events with help of social networks and focused                     Haiti earthquake. They showed that SensePlace2 could
crawling. Section 3 describes the details of the system in          be useful for finding refugee streams that are not
development; it presents the natural language processing            represented in official sources.
pipeline, method for focused crawling, and faceted                       In [24], researchers present a method for
search techniques. In section 4, the results of the                 classification of messages acquired from a message
preliminary experiments are presented and discussed.                stream. They demonstrate its capabilities of finding
Section 5 concludes and outlines the future work.                   useful emergency related messages on Twitter data. The
                                                                    method can classify messages as useful and non-useful
2 Related work                                                      via standard supervised machine learning methods
    The problem of event detection in text streams has a            (Naïve Bayes and Maximum entropy). The most
lot of attention from the research community. Methods               remarkable thing is a feature set used for training.
that were developed to address this problem were applied            Besides low-level features, they also conducted
to many domains. One of them is monitoring                          experiments with high-level features like message
emergencies. It was noticed that mass emergencies                   objectivity, whether it is personal or impersonal, whether
initiate the intensive exchange of information in social            it is formal or informal. The authors show that high-level
networks. This immense text stream contains cues about              features substantially improve the quality of
a situation in an affected area, infrastructure damage,             classification. The out-of-domain evaluation showed
human casualties, requests and proposals for help. It is a          accuracy from 30 to 80%. The experiments are conducted
crucial information that can enhance the situation                  on the data acquired during Haiti earthquakes, USA
awareness [18] of both affected people and participants             wildfires, and floods.
of rescue operations. However, it is mixed up with heavy                 The system EMERSE (Enhanced Messaging for the
noise: irrelevant or useless messages. Therefore, to put it         Emergency Response Sector) [4] collects messages from
to good use, new methods and technologies are required.             different sources, translates them, and classifies them
The need of such technologies became apparent, which                into topics for better search and filtering. EMERSE
facilitated the development of many diverge systems for             consists of a smartphone application, a Twitter crawler, a
mining emergency related information in social                      translation subsystem, and a subsystem for classification.
networks. We review the most significant recent work on             The smartphone application is considered to simplify a
such systems.                                                       process of collecting messages and their metadata such
    Papers [20] and [17] present an information flow                as location, time, and associated media files (photo,
monitoring system Twittris designed for processing of               video). Besides, the system crawls Twitter considering
short messages from mass and social media, as well as               timestamps and eliminating duplicates (reposts).
SMS-messages. Researchers tested the system on Twitter              EMERSE classifies messages into multiple classes using
data. The system crawls messages from Twitter using a               support vector machine. In [4], authors experimented
set of keywords, which is expanded over time by the most            with different features and feature selection methods: bag
significant n-grams extracted from acquired messages.               of words, feature abstraction methods [21], Latent
The system extracts the spatial and temporal information,           Dirichlet Allocation (LDA), and others. The system was
as well as topics, which are used for message clustering.           tested on a collection of messages submitted to the




                                                              149
Ushahidi3 web-service during the Haiti earthquake. In                  Another system for vertical search of information
this example, the authors demonstrate that EMERSE can              about emergencies is described in [27]. The system
improve coordination of people during the emergencies.             includes a focused ontology-based crawler. An extensive
    In [25, 26], a system ESA (Emergency Situation                 ontology describing various emergencies is designed for
Awareness) is presented. It can monitor social networks            the crawler.
and blogs in real time and visualize information about
                                                                       It is also worth mentioning Tweedr [2] – an open-
different emergencies. The main task of the system is to           source system that can find informative messages from
enhance situational awareness of people in an affected             Twitter for information support of people involved in
area. The system is oriented on New Zealand and
                                                                   rescue operations. It can distinguish general messages
Australia regions. ESA gathers tweets and detects topical          from the ones that have particular information about
bursts in information streams. The retrospective data is           infrastructure damage and human casualties. Another
used for building a language model, which is applied for           recent effort in constructing tweet classification system
the further burst detection. The algorithm searches lexis          is described in [5]. The authors use deep natural language
that has a very diverge distribution comparing to the
                                                                   processing techniques and rich set of features to
language model. For convenient representation of bursts            determine whether a message contains information about
for end-users, ESA performs thematic clustering of                 damage dealt during natural disasters. In [14], an
messages. The system also selects informative messages
                                                                   approach for construction of crisis-related terms is
that signal about emergencies, destructions, and requests          proposed. Authors used pseudo-relevance feedback
for help. ESA has a component that extracts relevant               mechanisms to expand a number of seeding terms during
spatial data using explicit geotags of messages (GPS-
                                                                   crawling, which results in recall improvement of
coordinates received from a smartphone) and implicit               retrieving of messages related to mass emergencies.
information, found in user profiles. The conversion from           Another lexicon called EMTerms is described in [23].
geographical names to coordinates is performed by
                                                                   The authors claim that it is the biggest crisis-related
Yahoo geo-service4 (retired today). ESA also performs              lexicon for Twitter analysis so far.
named entity recognition: it extracts names of
organizations, names of people, geographical entities,                Solutions for monitoring events in text streams
dates, and timestamps. All these data can be visualized            heavily depend on focused crawling techniques. We
on a map, which could be useful for providing better               review some of the state-of-the art approaches below.
representation of found events for end users.                          ICrawl system [7] is a framework for focused
Visualization of data in ESA is also enhanced with media           crawling of social networks. It adopts ontology based
files (images, videos), extracted from messages. The               crawling strategy. The novel feature of this system is a
authors tested ESA in Australian crisis center, which is           usage of Internet search engines for generation of
responsible for monitoring of natural disasters and other          bootstrap crawling points. In [3], researchers propose a
national security threats.                                         distributed crawler for continuous message-gathering
    AIDR5 (Artificial Intelligence for Disaster Response)          from particular user communities, which can circumvent
is an open-source platform for classification of messages          limits of Twitter API. In [11], automatic Topic-focused
related to emergencies [9]. The system detects messages            Monitor is presented. It samples tweets from the message
about different topics: infrastructure damage, casualties,         stream and selects keywords to track target topics based
required or available donations. The authors point out             on the samples.
that classifiers trained on the data collected during one
disaster perform badly on the data acquired from new                   The review shows that there are a plenty of systems
disasters. They address this problem by introducing                for monitoring emergency related events in textual
human annotation into the process of adapting the system           streams intended to improve situational awareness of
to new tasks. When a new emergency happens, the                    affected people and rescue teams. In our work, we
system should be retrained. The training dataset for               consider a particular geographical region – the Arctic
supervised machine learning is composed from the old               zone, which complicates focused crawling and filtering
labeled data and the data urgently annotated via                   of data. Many aforementioned systems specialize on
crowdsourcing services. The systems have elements of               narrow problems like message classification, whereas
active learning; it chooses for human annotation the most          our research is oriented on the development of a full-
informative samples that can significantly leverage                stack system that solves many tasks: from focused
classification performance. The authors tested the system          crawling and information extraction, to faceted search
on the collection of messages related to Pakistan                  leveraged with spatial and temporal metadata. Unlike the
earthquake in 2013.                                                aforementioned systems, the framework proposed in this
    TEDAS [10] is the system for emergency detection               paper is oriented on processing messages in both English
via focused crawling of Twitter messages. TEDAS                    and Russian languages. This is significant because of the
collects topic relevant messages using the Twitter search          large area of the Arctic territories of Russia. We note that
API. The system uses the original crawling strategy that           many systems use Twitter data for evaluation, and we
consists in dynamic shifting of crawler focus.                     also use this approach in our work.

3                                                                  5
 https://www.ushahidi.com/                                          http://aidr.qcri.org/
4
 https://developer.yahoo.com/boss/geo/




                                                             150
    Figure 1 Framework for crawling of emergency messages

                                                                      implements rather simple rules to filter out common false
3 System components                                                   positives that take into account parts of speech and
                                                                      capitalization of words.
3.1 Natural language processing pipeline                                 We also tag crisis-related lexis in texts; it enhances
    The system performs deep natural language                         and simplifies filtering and search. The data for this
processing of Russian and English texts. Besides basic                purpose is taken from CrisisLex lexicon, proposed in
processing tools, the pipeline also includes syntax                   [14].
parsing, semantic role labelling, and named entity
recognition.                                                          3.2 Focused crawling framework
    The basic analysis for Russian texts is performed by
AOT.ru6. This framework is used for tokenization,                          We deal with several social networks, such as
sentence boundary detection, POS tagging, and                         Twitter, Facebook, and VKontakte, and with some news
lemmatization, including morphological disambiguation.                feeds (ArcticInfo, BarentsObserver, BBC, etc.) These
We use MaltParser7 trained on SynTagRus [13] for                      sources provide different kinds of content. The Twitter
dependency parsing of Russian texts and our semantic                  provides API for crawling of recent messages by
parser for semantic role labelling [19]. The same types of            keywords. However, the limitations of the API make the
linguistic analysis of English texts are performed via                topical crawling process challenging. Since results
Freeling [16]. Note that the syntax and semantic                      commonly contain much irrelevant noise, additional
annotations are used for information search (see section              filtering is necessary. We access Facebook and
3.3).                                                                 VKontakte primarily via links in twitter messages that are
    For the basic named entity recognition, we used                   considered topically relevant. The news feeds have a
Polyglot NER framework [1]. It implements language                    static structure, therefore, they can be processed by a
agnostic approach and due to this provides named entity               common crawler with a preliminary created static task.
recognition for many languages including English and                  The data acquired from news feeds do not need topical
Russian. It produces annotations for locations,                       filtering, because the crawling task can be restricted to
organizations, and person names. However, we found                    process only relevant sections. Since we deal with a
that the basic NER processor is not suitable for extracting           number of heterogeneous sources, we use several kinds
toponyms related to a particular region (e.g., Arctic                 of crawlers (see Fig. 1).
zone); it yields low recall in this task. Therefore, we                    The first type is a GeoTag crawler. It is used for
complemented Polyglot with a gazetteer.                               collecting messages from Twitter with specified
    The gazetteer was created on the basis of Geonames8               coordinates. Tweets may include geographical
database. It contains more than 11 million geographical               coordinates or geo-tags, which could be used for
locations of different types around the world with their              localization of their authors. We filter all messages,
names (in many languages including Russian and                        whose geo-tag latitude is less than 60 degrees.
English), geographical coordinates, and other metadata.                    The second type is a Topic crawler. These crawlers
From Geonames, we extracted location names that are                   download topically relevant messages from Twitter with
situated on the north of the 60th latitude. The gazetteer             unspecified coordinates. Each topic crawler has lists of
uses these data to mark spatial information in texts. It also         “permissive” and “restrictive” terms that are fed to

6                                                                     8
 http://aot.ru/                                                        http://www.geonames.org/
7
 http://maltparser.org/




                                                                151
 Table 1 Examples of topics for crawled data
  No Keywords                                                                                                      Relevant
  1     Bay, charity, Amazon, Antarctica, cdnpoli.                                                                 False
  2     Starling, Tuktoyaktuk, community, visit, bird, southern, blackbird.                                        True
  3     Ice, national, ship, circle, arctics, photography, day, june, pewenvironment.                              True
  4     Rescue, buntings, air, guardsmen, squadron, cranes, divers, spot.                                          True
  5     Haha, dart, Trump, meepismurder, white, sales, gauges, street.                                             False
  6     Icebreaker, Nunavut, hardy, apithanny, piece, fascinating, blue, warming, bear.                            True
  7     Home, conservation, thebigbidtheory, may, island, science, hydrazine.                                      False
  8     Spring, noaa, climatechange, water, super, sail, challenge, Mediterranean.                                 False
  9     Arctic, Alaska, skuas, Greenland, road, amb, melt, Anchorage, Bering.                                      True
  10    Life, natgeomag, trip, journey, remote, team, chukchi, collaborating.                                      True


Twitter search API. In the initial steps, several bootstrap           metadata and keywords extracted from search results of
terms are used for defining a target topic. The challenge             previous iterations. Additionally, search results could be
lies in limitations of topic search API, provided by                  filtered using different sets of meta fields that can be
Twitter. It restricts a size of a query and a response, which         static or dynamic.
leads to insufficient recall of the crawling process. The                  In the developed system, the faceted search is
simplicity of the query language causes the low precision             powered by the Exactus technology [15]. Its main
and recall of the collected data. We use multiple topic               advantage lies in ability to efficiently index rich linguistic
crawlers with different keyword subsets to solve the                  information including syntax relation, semantic roles, or
insufficient recall problem. NER and filtering are used to            other types of semantic annotations extracted from
improve the precision.                                                natural language text (e.g. named entities). This enables
    The last type of crawlers is a common crawler. It                 phrase search (results have to contain given syntactically
collects data from topically related sections of news                 connected phrases) semantic search (results are ranked
feeds. The crawlers of this type can also download pages              taking into account semantic similarity of the query and
from VKontakte and Facebook referenced by relevant                    indexed documents). We take advantage of this
Twitter posts.                                                        technology by introducing indexing by geographical
    The whole schema of data processing in our                        tags, timestamps, and emergency-related tags. This
framework is the following. In the first step, messages are           provides the ability to filter results efficiently by
collected by GeoTag and Topic crawlers. In the second                 semantic information like location, time, organizations,
step, we apply linguistic analyser, NER, and gazetteer to             persons, and topics. It also provides the ability to retrieve
the collected texts. Then, we filter all messages that do             information with certain tags filtered by other metadata
not contain any crisis lexis, toponyms, and geotags.                  producing the results that can be sifted with consequent
URLs from the remaining messages are fed to the                       queries.
common crawler that also processes topically related
news feeds. The selected useful messages and documents                4 Evaluation of system components
are indexed by the Exactus search engine [15].
    For Topic crawler, we build a topic model [8] of the                  We have conducted a series of experiments to assess
crawled messages every several days. It helps to track                the quality of the created components for focused
topic shifts in the message stream. We summarize topic                crawling, named entity recognition, and faceted search.
content with a keyword cloud and a set of the most                    The source of the data for evaluation is Twitter social
significant messages from the cluster. Then each topic is             network.      The     experimental     dataset    contains
marked as relevant or irrelevant by several assessors (see            approximately 100 thousand messages in English and
Table 1). We define the following types of posts as                   Russian. In the first experiment, we assessed accuracy of
relevant:                                                             the proposed focused crawling framework. More
   1. Posts about arbitrary events (past, current and                 specifically, we evaluated the quality of filtering. We
        planned) and locations in the Arctic.                         labelled several subsets of posts devoted to accidents in
   2. Arbitrary posts from users, who currently are in                Alaska and Bering Sea. Each post from the subsets was
        the Arctic zone.                                              labelled by three assessors to reach sufficient coherence
    The most significant terms from the relevant topics               of the test data. We have not applied a cross-validation
are sent to “permissive” keyword collections of topic                 approach here because the labelling was not used for the
crawlers, and terms from irrelevant topics are sent to                crawler training, just for testing. The standard measures
“restrictive” ones. Thus, the crawling process becomes                for supervised learning: precision, recall and F1-score,
responsible to trend shifts.                                          were used for each subset. Macro-averaging was used to
                                                                      evaluate the result assessments. Table 2 refers to the
3.3 Faceted search                                                    results of the crawling without and with filtering as
   The faceted search became a backbone for                           “Impure data” and “Filtered data” correspondingly.
professional search applications [6]. In this type of                     Applying the proposed filtering technique results in
search, users can iteratively specify queries using                   the substantial growth of the precision without the




                                                                152
significant decrease of the recall. This means that during          With the data crawled from Twitter, we experimentally
the crawling process we do not lose much topically                  demonstrated that the framework provides the basic
relevant data but substantially decrease the stored noise.          abilities for analysis of message streams about
We decided to choose a fairly soft filtering because,               emergencies in the restricted area.
although a stricter procedure would improve the                         In the future work, we are going to incorporate into
precision, it would also imply a more significant recall            the natural language processing pipeline components that
drop, which contradicts the purpose of the monitoring               extract information about ships and planes in the Arctic
system.                                                             zone. Bulk information is available openly on the web
                                                                    (e.g., MarineTraffic service9). Tagging ship names and
Table 2 Focused crawling evaluation
                                                                    their coordinates in document and message streams
                             P       R             F1               potentially can improve the quality of emergency event
       Impure data         0.26     1.00          0.41              detection and enhance the situation awareness.
       Filtered data       0.57     0.94          0.70                  We are going to accumulate more retrospective data
                                                                    from social networks and other sources to increase the
    In the second experiment, we estimated the                      recall of the crawling process. Among many other types
performance of named entity recognition performed by                of information sources, collections of reports from rescue
Polyglot and gazetteer. We labelled all location mentions           services are the most prospect supplement for the
in 300 tweets that were downloaded by the Topic crawler             crawling. Another way to improve topic crawling is
and measured precision, recall, and F1-score for                    detection of users and groups in social networks that
extraction of spatial entities (Table 3).                           constantly post topically relevant messages. This could
                                                                    be done semi-automatically by building topical models
Table 3 NER evaluation (on locations)
                                                                    on users and groups. We are also going to create
                         P          R              F1               visualization tools for geotagged messages that can
      Polyglot         0.78        0.57           0.66              present events on the map.
     Gazetteer         0.78        0.74           0.76
  Polyg.+gazetter      0.76        0.82           0.79              Acknowledgments
    Results show that proposed Gazetteer significantly                  The project is supported by the Russian Foundation
outperforms Polyglot on location extraction in terms of             for Basic Research, project number: 15-29-06045
recall. The knowledge source of Polyglot is Wikipedia               “ofi_m”.
that does not have the full coverage of locations. We
conclude that it is reasonable to use the gazetteer and             References
Polyglot together for the maximum performance.                      [1] Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, and
    In the last experiment, we assessed the performance                 Steven Skiena. Polyglot-NER: Massive multilingual
gain of the information search achieved by using the                    named entity recognition. In Proceedings of the 2015
proposed emergency faceted search method in                             SIAM International Conference on Data Mining.
comparison to the baseline algorithm. We deployed the                   SIAM, 2015.
Exactus full-text search algorithm without filtering by tag
locations as the baseline. For the evaluation, we applied           [2] Zahra Ashktorab, Christopher Brown, Manojit
the NDCG score and peer reviewing approach. The                         Nandi, and Aron Culotta. Tweedr: Mining Twitter to
results are presented in Table 4.                                       inform disaster response. Proceedings of ISCRAM,
                                                                        pages 354–358, 2014.
Table 4 Faceted search evaluation
                                                                    [3] Matko Boanjak, Eduardo Oliveira, José Martins,
                    3-DCG       5-DCG          10-DCG                   Eduarda Mendes Rodrigues, and Luís Sarmento.
    Faceted          0.76         0.76           0.70                   Twitterecho: a distributed focused crawler to support
    Baseline         0.61         0.55           0.53                   open research with twitter data. In Proceedings of the
                                                                        21st international conference companion on World
   It was revealed that use of location and crisis tags for             Wide Web, pages 1233–1240. ACM, 2012.
faceted search significantly improves the quality of                [4] Cornelia Caragea, Nathan McNeese, Anuj Jaiswal,
ranking when searching posts about emergencies.                         Greg Traylor, Hyun-Woo Kim, Prasenjit Mitra,
                                                                        Dinghao Wu, Andrea H. Tapia, Lee Giles,
5 Conclusion                                                            Bernard J. Jansen, et al. Classifying text messages
   We presented an automated framework for crawling                     for the Haiti earthquake. In Proceedings of
and processing textual documents about emergency                        ISCRAM, 2011.
events in the Arctic zone. The main functions of the
proposed framework are focused crawling and faceted                 [5] Stefano Cresci, Maurizio Tesconi, Andrea Cimino,
search that takes into account information about                        and Felice Dell’Orletta. A linguistically-driven
geographical locations and timestamps of messages.                      approach to cross-event damage assessment of

9
    http://www.marinetraffic.com/




                                                              153
    natural disasters from social media messages. In                  communications in crises. In Proceedings of
    Proceedings of the 24th International Conference on               ICWSM, 2014.
    World Wide Web Companion, pages 1195–1200.
    International World Wide Web Conferences                      [15] Gennady Osipov, Ivan Smirnov, Ilya Tikhomirov,
    Steering Committee, 2015.                                          Ilya Sochenkov, and Artem Shelmanov. Exactus
                                                                       expert – search and analytical engine for research
[6] Pavlos Fafalios and Yannis Tzitzikas. Exploratory                  and development support. In Novel Applications of
    professional search through semantic post-analysis                 Intelligent Systems, pages 269–285. Springer, 2016.
    of search results. In Professional Search in the
    Modern World, pages 166–192. Springer, 2014.                  [16] Lluís Padró and Evgeny Stanilovsky. Freeling 3.0:
                                                                       Towards wider multilinguality. In Proceedings of the
[7] Gerhard Gossen, Elena Demidova, and Thomas                         Language Resources and Evaluation Conference
    Risse. The iCrawl Wizard – supporting interactive                  (LREC 2012). ELRA, 2012.
    focused crawl specification. In Advances in
    Information Retrieval, pages 797–800. Springer,               [17] Hemant Purohit and Amit P Sheth. Twitris v3: From
    2015.                                                              citizen sensing to analysis, coordination and action.
                                                                       In Proceedings of ICWSM, pages 746–747, 2013.
[8] Thomas Hofmann. Probabilistic latent semantic
    indexing. In Proceedings of the 22nd annual                   [18] Nadine B Sarter and David D Woods. Situation
    international ACM SIGIR Conference on Research                     awareness: A critical but ill-defined phenomenon.
    and Development in Information Retrieval, pages                    The International Journal of Aviation Psychology,
    50–57. ACM, 1999.                                                  1(1):45–57, 1991.

[9] Muhamma-d Imran, Carlos Castillo, Ji Lucas,                   [19] A. O. Shelmanov and I. V. Smirnov. Methods for
    Patrick Meier, and Sarah Vieweg. AIDR: Artificial                  semantic role labeling of Russian texts. In
    intelligence for disaster response. In Proceedings of              Computational    Linguistics   and     Intellectual
    the companion publication of the 23rd International                Technologies. Papers from the Annual International
    Conference on World Wide Web Companion, pages                      Conference "Dialogue" (2014), number 13, pages
    159–162, 2014.                                                     607–620, 2014.

[10] Rui Li, Kin Hou Lei, Ravi Khadiwala, and Kevin               [20] Amit P. Sheth, Hemant Purohit, Ashutosh Sopan
     Chen-Chuan Chang. Tedas: A twitter-based event                    Jadhav, Pavan Kapanipathi, and Lu Chen.
     detection and analysis system. In Data engineering                Understanding events through analysis of social
     (ICDE), 2012 IEEE 28th international conference,                  media. Kno.e.sis Center, Wright State University,
     pages 1273–1276. IEEE, 2012.                                      Tech. Rep., 2010.

                                                                  [21] Adrian Silvescu, Cornelia Caragea, and Vasant
[11] Rui Li, Shengjie Wang, and Kevin Chen-Chuan
                                                                       Honavar. Combining super-structuring and
     Chang. Towards social data platform: Automatic
                                                                       abstraction on sequence classification. In
     topic-focused monitor for twitter stream.
                                                                       Proceedings of ICDM, pages 986–991. IEEE, 2009.
     Proceedings of the VLDB Endowment, 6(14):1966–
     1977, 2013.
                                                                  [22] Juan Sixto, Oscar Pena, Bernhard Klein, and Diego
                                                                       López-de Ipina. Enable tweet-geolocation and don’t
[12] Alan M. MacEachren, Anuj Jaiswal, Anthony C.                      drive ERTs crazy! Improving situational awareness
     Robinson, Scott Pezanowski, Alexander Savelyev,
                                                                       using Twitter. Proceedings of SMERST, pages 27–
     Prasenjit Mitra, Xiao Zhang, and Justine Blanford.                31, 2013.
     Senseplace2: Geotwitter analytics support for
     situational awareness. In Proceedings of Visual              [23] Irina Temnikova, Carlos Castillo, and Sarah Vieweg.
     Analytics Science and Technology (VAST) on IEEE                   Emterms 1.0: a terminological resource for crisis
     Conference, pages 181–190, 2011.                                  tweets. In ISCRAM 2015 Proceedings of the 12th
                                                                       International Conference on Information Systems
[13] Joakim Nivre, Igor M. Boguslavsky, and Leonid L.                  for Crisis Response and Management, 2015.
     Iomdin. Parsing the SynTagRus treebank of Russian.
     In Proceedings of the 22nd International Conference          [24] Sudha Verma, Sarah Vieweg, William J Corvey,
     on Computational Linguistics (Coling 2008), pages                 Leysia Palen, James H Martin, Martha Palmer,
     641–648, 2008.                                                    Aaron Schram, and Kenneth Mark Anderson.
                                                                       Natural language processing to the rescue?
[14] Alexandra Olteanu, Carlos Castillo, Fernando Diaz,                Extracting "situational awareness" tweets during
     and Sarah Vieweg. CrisisLex: A lexicon for                        mass emergency. In Proceedings of ICWSM, pages
     collecting    and      filtering     microblogged                 385–392, 2011.




                                                            154