<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>I See a Car Crash: Real-time Detection of Small Scale Incidents in Microblogs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Axel Schulz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petar Ristoski</string-name>
          <email>petar.ristoskig@sap.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Heiko Paulheim</string-name>
          <email>heiko@informatik.uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SAP Research</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technische Universitat Darmstadt Telecooperation Lab</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Mannheim Data and Web Science Group</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Microblogs are increasingly gaining attention as an important information source in emergency management. Nevertheless, it is still di cult to reuse this information source during emergency situations, because of the sheer amount of unstructured data. Especially for detecting small scale events like car crashes, there are only small bits of information, thus complicating the detection of relevant information. We present a solution for a real-time identi cation of small scale incidents using microblogs, thereby allowing to increase the situational awareness by harvesting additional information about incidents. Our approach is a machine learning algorithm combining text classi cation and semantic enrichment of microblogs. An evaluation based shows that our solution enables the identi cation of small scale incidents with an accuracy of 89% as well as the detection of all incidents published in real-time Linked Open Government Data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Social media platforms are widely used for sharing information about incidents.
Ushahidi, a social platform used for crowd-based ltering of information [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], was
heavily used during the Haitian earthquake for labeling crisis related information,
as well as incidents such as the Oklahoma grass res and the Red River oods in
April 2009 [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] or the terrorist attacks on Mumbai [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. All these examples show
that citizens already act as observers in crisis situations and provide potentially
valuable information on di erent social media platforms.
      </p>
      <p>Current approaches for using social media in emergency management largely
focus on large scale incidents like earthquakes. Large-scale incidents are
characterized by a large number of social media messages as well as a wide
geographic and/or temporal coverage. In contrast, small-scale incidents, such as car
crashes or res, usually have a small number of social media messages and only
narrow geographic and temporal coverage. This imposes certain challenges on
approaches for detecting small scale incidents: large incidents, like earthquakes,
with thousands of social media postings, are much easier to detect than smaller
incidents with only a dozen of postings, and further processing steps, such as
the extraction of factual information, can work on a larger set of texts. For large
scale incidents, one can optimize systems for precision, whereas recall is not
problematic due to the sheer amount of information on the incident. In contrast,
detecting small scale incidents imposes much stricter demands on both precision
and recall.</p>
      <p>
        The approach discussed in this paper combines information from the social
and the semantic web. While the social web consists of text, images, and videos,
all of which cannot be processed by intelligent agents easily, the Linked Open
Data (LOD) cloud contains semantically annotated, formally captured
information. Linked Open Data can be used to enhance Web 2.0 contents by semantically
enriching it, as shown, e.g., in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Such enrichments may also be used as
features for machine learning [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In our approach, we leverage machine
learning and semantic web technologies for detecting user-generated content that is
related to a small scale incident with high accuracy. Furthermore, we re ned
current approaches for spatial and temporal ltering allowing us to detect space
and time information of a small scale incident more precisely. We further show
how Linked Open Government Data can be used to evaluate classi cations.
      </p>
      <p>This paper contributes an approach that leverages information provided in
the social web for detection of small scale incidents. The proposed method
consists of two steps: (1) automatic classi cation of user-generated content related to
small scale incidents, and (2) pre ltering of irrelevant content, based on spatial,
temporal, and thematic aspects.</p>
      <p>The rest of this paper is structured as follows: In section 2, we review related
approaches. Section 3 provides a detailed description of our process, followed by
an evaluation in section 4. In section 5, we show an example application. We
conclude in Section 6 with a short summary and an outlook on future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Event and incident detection in social media gained increased attention the last
years. In this case, various machine learning approaches have been proposed for
detecting large scale incidents in microblogs, e.g., in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        All those approaches focus on large scale incidents. On the other side, only
few state-of-the-art approaches focus on the detection of small scale incidents
in microblogs. Agarwal et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] focus on detecting events related to a re in a
factory. They rely on standard NLP-based features like Named Entity
Recognition (NER) and part-of-speech tagging. Furthermore, they employ a spatial
dictionary to geolocalize tweets on city level. They report a precision of 80%
using Nave Bayes classi er.
      </p>
      <p>
        Twitcident [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a mashup for ltering, searching, and analyzing social
media information about small scale incidents. The system uses information about
incidents published in an emergency network in the Netherlands for
constructing an initial query to crawl relevant tweets. The collected messages are further
Tweets
      </p>
      <p>Preprocessing
• Pre-filtering
• Removing stopwords
• Correcting spelling errors
• Slang replacement
• POS filtering
• Temporal mention replacement
• Spatial mention replacement</p>
      <p>Feature Extraction
• Word n-grams
• Char n-grams
• TF-IDF
• Syntactic Features
• Spatial/Temporal Features
• FeGeLOD Features</p>
      <p>Incident</p>
      <p>
        Detection
• Trained classifier for
incident types
• Temporal and spatial
filtering
processed by the semantic enrichment module which includes NER using
DBpedia Spotlight [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The extracted concepts are used as attribute-value pairs, e.g.,
'(location, dbpedia:Austin Texas)'. In this case, those pairs are used to create
a weighting of important concepts for di erent types of incidents. Furthermore,
referenced web pages in the tweet message are provided as external
information. A classi cation of tweets is done using manually created rules based on the
attribute-value pairs and keywords. Though they show an advantage of using
sem© 2a01n3StAiPcAGf.Ael arighttsurerseervesd., they do not provide any evaluation results for detInetecrntailng sm3all
scale incidents.
      </p>
      <p>
        Wanichayapong et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] focus on extracting tra c information in
microblogs from Thailand. Compared to other approaches, they use an approach
which detects tweets that contain place mentions as well as tra c related
information. The evaluation of the approach was made on 1249 manually labeled
tweets and showed an accuracy of 91.7%, precision of 91.39%, and recall of
87.53%. Though the results are quite promising, they restricted their initial test
set to tweets containing manually de ned tra c related keywords, thus, the
number of relevant tweets is signi cantly higher than in a random stream.
      </p>
      <p>
        Li et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] introduce a system for searching and visualization of tweets
related to small scale incidents, based on keyword, spatial, and temporal ltering.
Compared to other approaches, they iteratively re ne a keyword-based search
for retrieving a higher number of incident related tweets. Based on these tweets
a classi er is built upon Twitter speci c features, such as hashtags, @-mentions,
URLs, and spatial and temporal characteristics. Furthermore, events are
geolocalized on city scale. They report an accuracy of 80% for detecting incident
related tweets, although they do not provide any information about their
evaluation approach.
      </p>
      <p>In summary, only few of the mentioned approaches make use of semantic web
technologies for detecting small scale incidents. Furthermore, all of the mentioned
approaches do not compare with events published in governmental data sources,
thus the detection of incidents in real-time remains unevaluated.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <p>the preprocessed tweets, several features are extracted and used for classi
cation. Second, for real-time incident detection, new tweets are rst ltered based
on spatial and temporal ltering. Then the classi er is applied to detect
incident related information. For optimizing our classi er, we evaluate the results
on incident reports published in Linked Open Government Data. As a result,
the optimized pipeline can be used to present valuable information for decision
makers in emergency management systems.
3.1</p>
      <sec id="sec-3-1">
        <title>Classi cation</title>
        <p>Crawling and Preprocessing We continuously collect tweets using the
Twitter Search API1. In this case, we restrict our search to English tweets of certain
cities. Every collected tweet is then preprocessed. First, we remove all retweets
as these are just duplicates of other tweets and do not provide additional
information. Second, @-mentions in the tweet message are removed as we assume
that they are not relevant for detection of incident related tweets. Third, very
frequent words like stop words are removed as they are not valuable as
features for a machine learning algorithm. Fourth, abbreviations are resolved using
a dictionary compiled from www.noslang.com. Furthermore, as tweets contain
spelling errors, we apply the Google Spellchecking API2 to identify and replace
them if possible.</p>
        <p>We use our temporal detection approach (see below) to detect temporal
expressions and replace them with two annotations @DATE and @TIME, so we
prevent over tting the classi cation model for temporal values from the training
dataset. Likewise, we use our spatial detection approach (see below) to detect
place and location mentions and replace them with two annotations @LOC and
@PLC.</p>
        <p>Before extracting features, we normalize the words using the Stanford
lemmatization function3. Furthermore, we apply the Stanford POS tagger4. This
enables us to lter out some word categories, which are not useful for our approach.
E.g., during our evaluation we found out that using only nouns and proper nouns
for classi cation improve the accuracy of the classi cation (cf. Section 4.3)
Feature Extraction After nishing the initial preprocessing steps, we extract
several features from the tweets that will be used for training a classi er:
Word unigram extraction: A tweet is represented as a set of words. We use
two approaches: a vector with the frequency of words, and a vector with the
occurrence of words (as binary values).</p>
        <p>Character n-grams: A string of three respective four consecutive characters in
a tweet message is used as a feature. For example, if a tweet is: \Today is
so hot. I feel tired" then the following trigrams are extracted: \tod", \oda",
1 https://dev.twitter.com/docs/api/1.1/get/search/tweets
2 https://code.google.com/p/google-api-spelling-java/
3 http://nlp.stanford.edu/software/corenlp.shtml
4 http://nlp.stanford.edu/software/tagger.shtml
\day", \ay ", \y I", etc.. To construct the trigram respective fourgram list,
all the special characters, which are not a letter, a space character, or a
number, are removed.</p>
        <p>
          TF-IDF: For every document we calculate an accumulated tf-idf score [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. It
measures the overall deviation of a tweet from all the positive tweets in the
training set by summing the tf-idf scores of that tweet, where the idf is
calculated based on all the positive examples in the training set.
Syntactic features: Along with the features directly extracted from the tweet,
several syntactic features are expected to improve the performances of our
approach. People might tend to use a lot of punctuations, such as explanation
mark and question mark, or a lot of capitalized letter when they are reporting
some incident. In this case, we extract the following features: the number of
\!" and \?" in a tweet and the number of capitalized characters.
Spatial and Temporal unigram features: As spatial and temporal mentions are
replaced with corresponding annotations, they appear as word unigrams or
character n-grams in our model and can therefore be regarded as additional
features.
        </p>
        <p>
          Linked Open Data features: FeGeLOD [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] is a framework which extracts
features for an entity from Linked Open Data. For example, for the
instance dbpedia:Ford Mustang, features that can be extracted include the
instance's direct types (such as Automobile), as well as the categories of
the corresponding Wikipedia page (such as Road transport). We use the
transitive closures of types and categories with respect to rdfs:subClassOf
and skos:broader, respectively. The rationale is that those features can be
extracted on a training set and on a test set, even if the actual instance has
never been seen in the test set: the types and categories mentioned above
could also be generated for a tweet talking about a dbpedia:Volkswagen
Passat, for example. This allows for a semantic abstraction from the
concrete instances a tweet talks about. In order to generate features from tweets
with FeGeLOD, we rst preprocess the tweets using DBpedia Spotlight in
order to identify the instances a tweet talks about. Unlike the other feature
extractions, the extraction of Linked Open Data features is performed on
the original tweet, not the preprocessed one.
        </p>
        <p>
          Classi cation The di erent features are combined and evaluated using three
classi ers. For classi cation, the machine learning library Weka [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] is used. We
compare a Nave Bayes Binary Model (NBB), the Ripper rule learner (JRip),
and a classi er based on a Support Vector Machine (SVM).
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Real-time incident detection</title>
        <p>For real-time incident detection, we rst apply spatial and temporal ltering
before applying the classi er.</p>
        <p>Temporal Extraction Using the creation date of a tweet is not always su
cient for detecting events, as people also report on incidents that occurred in the
past or events that will happen in the future. During our evaluations we found
out, that around 18% of all incident related tweets contain temporal information.
Thus, a mechanism can be applied to lter out tweets that are not temporally
related to a speci c event.</p>
        <p>
          For identifying what the temporal relation of a tweet is, we adapted the
HeidelTime [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] framework for temporal extraction. HeidelTime is a rule-based
approach mainly using regular expressions for the extraction of temporal
expressions in texts. As the system was developed for large text documents, we adapted
it to work on microblogs. Based on our adaptation, we are able to extract time
mentions in microblogs like 'yesterday' and use them to calculate the time of
the event to which a tweet refers to. E.g., the tweet 'I still remember that car
accident from Tuesday', created on FR 15.02.2013 14:33, can now be readjusted
to reference an accident on TU 12.02.2013 14:33.
        </p>
        <p>As discussed above, we also use our adaptation to replace time mentions in
microblogs with the annotations @DATE and @TIME to use temporal mentions
as additional features. Furthermore, compared to other approaches we are able to
retrieve precise temporal information about a small scale incident, which enables
the real-time detection of current incidents.</p>
        <p>Spatial Extraction Besides a temporal ltering, a spatial ltering is also
applied. As only 1% of all tweets retrieved from the Twitter Search API are
geotagged, location mentions in tweet messages or the user's pro le information
have to be identi ed.</p>
        <p>For location extraction, we use a threefold approach. First, location mentions
are identi ed using Stanford NER5. As we are only interested in recognizing
location mentions, which includes streets, highways, landmarks, blocks, or zones,
we retrained the Stanford NER model on a set of tweets, using a set of 1250
manually labeled tweets. We use only two classes for named entities: LOCATION
and PLACE. All location mentions, for which we can extract accurate geo
coordinates, such as cities, streets and landmarks, are labeled with LOCATION.
The words of the tweets that are used as to abstractly describe where the event
took place, such as \home", \o ce", \school" etc., are labeled with PLACE.
The customized NER model has precision of 95.5% and 91.29% recall. As
discussed above, we also use our adaptation to annotate the text so that a spatial
mention can be used as an additional feature. The following tweet may be the
result of this approach: \Several people are injured in car crash on &lt;LOC&gt;5th
Ave&lt;/LOC&gt; &lt;LOC&gt;Seattle&lt;/LOC&gt;"</p>
        <p>
          Second, to relate the location mention to a point where the event happened,
we geocode the location strings. In this case, we create a set of word unigrams,
bigrams, and trigrams. These are sent to the geographical database GeoNames6
to identify city names in each of the n-grams and to extract geocoordinates. As
city names are ambiguous around the world, we choose the longest n-gram as
the most probable city. If there is no city mention in the tweet, we try to extract
the city from the location eld in the user pro le (see [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] for details).
        </p>
        <sec id="sec-3-2-1">
          <title>5 http://nlp.stanford.edu/software/CRF-NER.shtml 6 http://www.geonames.org</title>
          <p>Third, for ne-grained geolocalization, on street or building level, we are
using the MapQuest Nominatim API7, which is based on OpenStreetMap data.
Using this approach we are able to extract precise location information for 87%
of the tweets. Compared to other approaches, which rely on city level precision,
we are able to precisely geolocalize small scale events on street level.
Classi cation After temporal and spatial ltering, we apply our classi er to
identify incident related tweets. For further re nement and the veri cation that
our approach enables the real-time detection of tweets, we compare our results
with the o cial incident information published in Linked Open Government
Data like the data that can be retrieved from data.seattle.gov. Finally, the
detected relevant information can be presented to decision makers in incident
management systems (cf. Section 5).
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>We conduct an evaluation of our method on the publicly available Twitter feed.
First, we evaluate the performance of the FeGeLOD features. Second, we measure
the performance using all features and third, we compare our results to real-time
incident reports.
4.1</p>
      <sec id="sec-4-1">
        <title>Datasets</title>
        <p>Training Dataset For building a training dataset, we collected 6 million public
tweets using the Twitter Search API from November 19th, 2012 to December
19th, 2012 in a 15km radius around the city centers of Seattle, WA and Memphis,
TN. For labeling the tweets, we rst extracted tweets containing incident related
keywords. We retrieved all incident types using the \Seattle Real Time Fire 911
Calls" dataset from seattle.data.gov and de ned one general keyword set with
keywords that are used in all types of incidents like \incident", \injury",
\police" etc. For each incident type we further identi ed speci c keywords, e.g., for
the incident type \Motor Vehicle Accident Freeway" we use the keywords
\vehicle", \accident", and \road". Based on these words, we use WordNet8 to extend
this set by adding the direct hyponyms. For instance, the keyword \accident"
was extended with \collision", \crash", \wreck", \injury", \fatal accident", and
\casualty".</p>
        <p>For building our training set for identifying car crashes, we used the general
and the speci c keyword set for the incident types \Motor Vehicle Accident",
\Motor Vehicle Accident Freeway", \Car Fire", and \Car Fire Freeway" to
extract tweets that might be related to car crashes. We randomly selected 10k
tweets from this set to manually label the tweets in two classes \car accident
related" and \not car incident related". The tweets were labeled by scienti c
members of our departments. The nal training set consists of 993 car accident
related and 993 not car accident related tweets.</p>
        <sec id="sec-4-1-1">
          <title>7 http://developer.mapquest.com/web/products/open/nominatim 8 http://wordnet.princeton.edu</title>
          <p>Test Dataset To show that the resulting model using the training dataset is not
over tted to the events in the period when the training data was collected, we
collected an additional 1.5 million tweets in the period from February 1st, 2013
to February 7th, 2013 from the same cities. We also used the keyword extraction
approach and manually labeled test dataset, resulting in 320 car accident related
tweets and 320 not car accident related tweets.</p>
          <p>Socrata Dataset For evaluating our approach with real-time Linked Open
Government Data, we use the \Seattle Real Time Fire 911 Calls" dataset from
seattle.data.gov. In the period from February 5th, 2013 to February 7th, 2013, we
collected information about 15 incidents related to car crashes and 830 related
to other incident types.
4.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Metrics</title>
        <p>Our classi cation results are calculated using strati ed 10-fold cross validation
on the training set, and evaluating a model trained on the training dataset on
the test dataset. To measure the performance of the classi cation approaches,
we report the following metrics:
{ Accuracy (Acc): Number of the correctly classi ed tweets divided by total
number of tweets.
{ Averaged Precision (Prec): Calculated based on the Precision of each class
(how many of our predictions for a class are correct).
{ Averaged Recall (Rec): Calculated based on the Recall of each class (how
many tweets of a class are correctly classi ed as this class).</p>
        <p>{ F-Measure (F): Weighted average of the precision and recall.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Results</title>
        <p>Evaluation of FeGeLOD Features We used DBPedia Spotlight to detect
named entities in the tweet messages9. These annotations were used to
generate additional features with FeGeLOD, as discussed above. Table 1 shows the
classi cation accuracy achieved using only features generated with FeGeLOD
from the training dataset. We used FeGeLOD to create three models with
different features. The rst one contains only types of the extracted entities, the
second one contains only Wikipedia categories of the extracted entities, and the
third one contains both categories and types of the extracted entities from the
tweets. The best results with accuracy of 67.1% are achieved when using
categories and types. Furthermore, analyzing the results from JRip, we get rules
using categories as \Accidents", \Injuries", or \Road infrastructure" and Types
as \Road104096066" or \AdministrativeArea". This shows that the features
generated by FeGeLOD are actually meaningful.
9 The parameters used were: Con dence=0.2; Contextual score= 0.9; Support = 20;
Disambiguator = Document; Spotter=LingPipeSpotter; Only best candidate
Evaluation of all features In order to evaluate which machine learning
features contribute the most to the accuracy of the classi er, we built di erent
classi cation models for all the combinations of the features that we described
in Section 3.1. We rst evaluated the models on the training dataset, then we
reevaluated the models on the test dataset. Table 2 shows the best classi cation
results achieved using di erent combinations of machine learning features.</p>
        <p>The tests showed that using word n-grams without POS ltering, TF-IDF
accumulate score, and syntactic features provide the best classi cation results
for the training dataset. But re-evaluating the same model on the test set that
contains data from a di erent time period, the results dropped signi cantly.
That leads to conclusion that the model is over tted to the training dataset.
Furthermore, the model contains a large number of features and requires a lot
of processing performance to be used for predictions in real-time.</p>
        <p>Adding the features generated by FeGeLOD did not a ect the classi cation
accuracy on the training dataset, but improved the classi cation accuracy on
the test dataset. This shows that using semantic features generated from LOD
helps to prevent over tting. Additionally, we used POS ltering to lter out
some word categories from the tweet to see which word categories contribute
to the classi cation performance. The tests showed that using only nouns and
proper nouns when generating the word n-grams improve the results on the test
set. This approach signi cantly reduced the number of attributes in the model,
making it applicable for real-time predictions.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Evaluation on real-world incident reports To evaluate how many incidents</title>
        <p>our system can detect compared to governmental emergency systems, we
evaluated our predictions with the data from the Socrata test set. We correlated each
of the 15 car accidents with the incidents from our system, if a spatial (150m)
and temporal (+/-20min) matching applies. Using this approach we were able
to detect all of the Socrata incidents with our approach. As the average number
of tweets identi ed by our system for each car accident was around ten (with a
minimum of only three tweets for one accident), this shows that our approach is
capable of detecting incidents with only very few social media posts.
Performance For our experiments, we have crawled the Twitter API as
discussed above. In the scenario using data from Seattle and Memphis, there were
around 100 Tweets per minute. Processing and classifying a bulk of 500 tweets
using our trained SVM takes around seven seconds, which is about 14
milliseconds per Tweet.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Example Application</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], we introduced the idea of an information cockpit called Incident
Classi er as a central access point for a decision maker to use di erent types of
user-generated content for increasing the understanding of the situation at hand.
Based on an aggregation algorithm we introduced in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], we integrate the
classied microblogs in the Incident Classi er. In this case, we apply spatio-temporal
ltering and aggregation based on the incident type, e.g., the \car crash\ type.
In Figure 2, the aggregation of di erent incident related information to a small
scale incident, including images extracted from referenced web pages in tweets,
is shown. E.g., in this case, a picture of the incident as well as the number
of involved cars are shown, thus, enabling a decision maker to get additional
information about an incident that might be relevant.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>This paper contributes an approach that leverages information provided in
microblogs for detection of small scale incidents. We showed how machine
learning and semantic web technologies can be combined to identify incident related
microblogs. With 89% detection accuracy, we outperform state-of-the-art
approaches. Furthermore, our approach is able to precisely localize microblogs in
space and time, thus, enabling the real-time detection of incidents.</p>
      <p>With the presented approach, we are able to detect valuable information
during crisis situations in the huge amount of information published in microblogs.
In this case, additional and previously unknown information can be retrieved
that could contribute to enhance situational awareness for decision making in
daily crisis management. In the future, we aim at re ning our approach, e.g.,
to use more sophisticated NLP techniques, exploring the capability of our
approach to detect other types of events, as well as including larger sets of open
government data in the evaluation.</p>
      <sec id="sec-6-1">
        <title>Acknowledgements</title>
        <p>This work has been partly funded by the German Federal Ministry for
Education and Research (BMBF, 13N10712, 01jS12024), by the German Science
Foundation (DFG, FU 580/2), and by EIT ICT Labs under the activities 12113
Emergent Social Mobility and 13064 CityCrowdSource of the Business Plan 2012.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hau</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Houben</surname>
            ,
            <given-names>G.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stronkman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tao</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Twitcident: Fighting Fire with Information from Social Web Stream</article-title>
          . In: International Conference on Hypertext and Social Media, Milwaukee, USA, ACM (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vaithiyanathan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shro</surname>
          </string-name>
          , G.:
          <article-title>Catching the long-tail: Extracting local news events from twitter</article-title>
          .
          <source>In: Proceedings of the Sixth International Conference on Weblogs and Social Media ICWSM</source>
          <year>2012</year>
          , Dublin, Ireland. (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Goolsby</surname>
          </string-name>
          , R.: Lifting Elephants:
          <article-title>Twitter and Blogging in Global Perspective</article-title>
          .
          <source>In: Social Computing and Behavioral Modeling</source>
          . Springer, Berlin, Heidelberg (
          <year>2009</year>
          ) 1{
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Heim</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thom</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>SemSor: Combining Social and Semantic Web to Support the Analysis of Emergency Situations</article-title>
          .
          <source>In: Proceedings of the 2nd Workshop on Semantic Models for Adaptive Interactive Systems SEMAIS</source>
          , Springer, Berlin, Heidelberg (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hienert</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wegener</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Automatic classi cation and relationship extraction for multi-lingual and multi-granular events from wikipedia</article-title>
          . In: Detection, Representation, and
          <article-title>Exploitation of Events in the Semantic Web (DeRiVE 2012)</article-title>
          .
          <article-title>Volume 902 of CEUR-WS</article-title>
          . (
          <year>2012</year>
          )
          <volume>1</volume>
          {
          <fpage>10</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jadhav</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mutharaju</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anantharam</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Twitris: Socially In uenced Browsing</article-title>
          .
          <source>In: Semantic Web Challenge</source>
          <year>2009</year>
          , demo at 8th International Semantic Web Conference,, Washington, DC, USA (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Krstajic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrdantz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hund</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiler</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Getting There First: RealTime Detection of Real-World Incidents on Twitter</article-title>
          .
          <source>In: Published at the 2nd IEEE Workshop on Interactive Visual Text Analytics</source>
          , Seattle, WA, USA. (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lei</surname>
            ,
            <given-names>K.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khadiwala</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>K.C.C.</given-names>
          </string-name>
          :
          <article-title>Tedas: A twitter-based event detection and analysis system</article-title>
          .
          <source>In: 2011 11th International Conference on ITS Telecommunications (ITST)</source>
          ,
          <source>IEEE Computer Society</source>
          (
          <year>2012</year>
          )
          <volume>1273</volume>
          {
          <fpage>1276</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Schutze., H. In: An Introduction to Information Retrieval. Cambridge University Press (
          <year>2009</year>
          )
          <volume>117</volume>
          {
          <fpage>120</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Marcus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>M.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Badar</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karger</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madden</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>R.C.</given-names>
          </string-name>
          :
          <article-title>Twitinfo: aggregating and visualizing microblogs for event exploration</article-title>
          .
          <source>In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI '11</source>
          , New York, NY, USA, ACM (
          <year>2011</year>
          )
          <volume>227</volume>
          {
          <fpage>236</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garc</surname>
            a-Silva,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : DBpedia Spotlight:
          <article-title>Shedding Light on the Web of Documents</article-title>
          .
          <source>In: Proceedings of the 7th International Conference on Semantic Systems (I-Semantics)</source>
          , Graz, Austria,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Okolloh</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Ushahidi, or 'testimony': Web 2.0 tools for crowdsourcing crisis information</article-title>
          .
          <source>Participatory Learning and Action</source>
          <volume>59</volume>
          (January) (
          <year>2008</year>
          )
          <volume>65</volume>
          {
          <fpage>70</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Furnkranz, J.:
          <article-title>Unsupervised Feature Generation from Linked Open Data</article-title>
          . In: International Conference on Web Intelligence, Mining, and
          <string-name>
            <surname>Semantics</surname>
          </string-name>
          (WIMS'12). (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Sakaki</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Okazaki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Earthquake shakes Twitter users: real-time event detection by social sensors</article-title>
          .
          <source>In: WWW '10 Proceedings of the 19th international conference on World Wide Web</source>
          . (
          <year>2010</year>
          )
          <volume>851</volume>
          {
          <fpage>860</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Schulz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hadjakos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nachtwey</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , , Muhlhauser,
          <string-name>
            <surname>M.:</surname>
          </string-name>
          <article-title>A multiindicator approach for geolocalization of tweets</article-title>
          .
          <source>In: Proceedings of the Seventh International Conference on Weblogs and Social Media (ICWSM)</source>
          .
          <article-title>(</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Schulz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ortmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Probst</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Getting user-generated content structured: Overcoming information overload in emergency management</article-title>
          .
          <source>In: Proceedings of 2012 IEEE Global Humanitarian Technology Conference (GHTC</source>
          <year>2012</year>
          ).
          <article-title>(</article-title>
          <year>2012</year>
          )
          <volume>1</volume>
          {
          <fpage>10</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Schulz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Probst</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Crisis Information Management in the Web 3.0 Age</article-title>
          . In
          <source>: Proceedings of the Information Systems for Crisis Response and Management Conference (ISCRAM</source>
          <year>2012</year>
          ).
          <article-title>(</article-title>
          <year>2012</year>
          ) 1{
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. Strotgen, J.,
          <string-name>
            <surname>Gertz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Multilingual and cross-domain temporal tagging</article-title>
          .
          <source>Language Resources and Evaluation</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Vieweg</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hughes</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Starbird</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Microblogging during two natural hazards events: what twitter may contribute to situational awareness</article-title>
          .
          <source>In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI '10</source>
          , New York, NY, USA, ACM (
          <year>2010</year>
          )
          <volume>1079</volume>
          {
          <fpage>1088</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Wanichayapong</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pruthipunyaskul</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pattara-Atikom</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaovalit</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Social-based tra c information extraction and classi cation</article-title>
          .
          <source>In: 11th International Conference on ITS Telecommunications (ITST)</source>
          .
          <article-title>(</article-title>
          <year>2011</year>
          )
          <volume>107</volume>
          {
          <fpage>112</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Data mining: practical machine learning tools and techniques</article-title>
          . Elsevier, Morgan Kaufman, Amsterdam, Netherlands (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>