=Paper= {{Paper |id=Vol-1392/paper-06 |storemode=property |title=Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts |pdfUrl=https://ceur-ws.org/Vol-1392/paper-06.pdf |volume=Vol-1392 |dblpUrl=https://dblp.org/rec/conf/icml/SchulzRFJ15 }} ==Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts== https://ceur-ws.org/Vol-1392/paper-06.pdf

Event-Based Clustering for Reducing Labeling Costs of Incident-Related
Microposts

Axel Schulz SCHULZ . AXEL @ GMX . NET
DB Mobility Logistics AG, Germany and Telecooperation Lab, Technische Universität Darmstadt, Germany
Petar Ristoski PETAR . RISTOSKI @ INFORMATIK . UNI - MANNHEIM . DE
Data and Web Science Group, University of Mannheim, Germany
Johannes Fürnkranz JUFFI @ KE . INFORMATIK . TU - DARMSTADT. DE
Knowledge Engineering Group, Technische Universität Darmstadt, Germany
Frederik Janssen JANSSEN @ KE . TU - DARMSTADT. DE
Knowledge Engineering Group, Technische Universität Darmstadt, Germany

Abstract 1. Introduction
Automatically identifying the event type of Detecting event-related information in microposts has
event-related information in the sheer amount of shown its value for a variety of domains. Especially in
social media data makes machine learning in- emergency management, different situational information
evitable. However, this is highly dependent on is present that could contribute to understand the situation
(1) the number of correctly labeled instances and at hand (Schulz, 2014). However, solving the actual prob-
(2) labeling costs. Active learning has been pro- lem of classifying the incident type in this domain requires
posed to reduce the number of instances to la- labeled data. One of the main problems with microposts
bel. Though, current approaches focus on the is acquiring ground truth for utilizing supervised learning.
thematic dimension, i.e., the event type, for se- Thus, we deal with two major issues: (1) The costs for la-
lecting instances to label; other metadata such beling a single instance, and (2) the number of instances to
as spatial and temporal information that is help- label.
ful for achieving a more fine-grained clustering
On the one hand, to actually build a classifier that is able
is currently not taken into account. Also, label-
to accurately predict the type of the incident mentioned in
ing quality is always assumed to be perfect as
a tweet, usually experts are deployed for labeling as they
currently no qualitative information is present for
have enough domain knowledge to create ground truth.
manual event type labeling.
However, as often several hundreds of examples have to be
In this paper, we present a novel event-based labeled until the classifier is able to reach sufficient qual-
clustering strategy that makes use of temporal, ity, relying on experts for labeling is not always possible
spatial, and thematic metadata to determine in- and it is costly. In contrast, labels can also be derived from
stances to label. Furthermore, we also inspect the non-experts, i.e., by making use of crowdsourcing. Given
quality of the manual labeling in a crowdsourcing that the labels obtained in this way are of sufficient quality,
study by comparing experts and non-experts. An the costs for such a process would be acceptable as crowd-
evaluation on incident-related tweets shows that sourcing is rather cheap. But up to now there is no infor-
(i) labels provided by crowdsourcing are of ac- mation about labeling quality for incident-related tweets.
ceptable quality and (ii) our selection strategy for Hence, first we proceeded by comparing the labeling qual-
active learning outperforms current state-of-the- ity of experts and non-experts.
art approaches even with few labeled instances.
On the other hand, the number of instances to label has to
be kept as low as possible. Due to the huge number of
Proceedings of the 2 nd International Workshop on Mining Urban tweets, labeling all instances is not possible as even with
Data, Lille, France, 2015. Copyright c 2015 for this paper by its cheap labeling the costs would explode. Keeping the num-
authors. Copying permitted for private and academic purposes. ber of instances to label low while maintaining accurate

99
Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts

classifiers is a typical active learning (Settles, 2012) prob- We begin with summarizing related approaches. Next, we
lem. Here, labeling costs are reduced by iteratively (1) se- show how the ground truth data was developed. Then we
lecting small subsets of instances to query for labels and (2) summarize the results of the study on crowdsourced labels
re-training a classifier with the newly labeled data. Thus, (Section 4) followed by a description of the event-based
in general, but also specifically for classifying microposts, clustering for active learning. After, the results are shown
there are two issues to solve, namely selecting a good initial and discussed (Section 6) and the paper is concluded.
training set and the right instances in each iteration.
For selecting appropriate instances, several selection strate- 2. Related Work
gies have been proposed based on the two criteria, informa-
Although active learning has been studied extensively for
tiveness and representativeness (Huang et al., 2010). Infor-
text classification (Hoi et al., 2006; Tong & Koller, 2002),
mativeness measures the usefulness of an instance to re-
it was used for tweets only by a few previous works.
duce the uncertainty of the model, whereas representative-
(Thongsuk et al., 2010) presented a technique for classi-
ness measures how good an instance represents the overall
fying tweets into three business types. They showed that
input of unlabeled data. The latter usually is solved by em-
using active learning outperforms simple supervised learn-
ploying clustering approaches where then from each cluster
ing approaches in terms of labeling costs.
the representative instances are drawn. Indeed, for event-
type classification the number of clusters to build is not (Hu et al., 2013) presented the ActNeT approach, which
known in advance, as it is unknown how often an event oc- takes the relations between tweets into account for identify-
curred. Hence, most often it is set to the number of distinct ing representative as well as informative instances. Based
event types, which obviously is not appropriate. For in- on a social network, the topology is used to detect repre-
stance, one event might be a tiny fire in a waste bin whereas sentative instances using the PageRank algorithm. Infor-
another is a huge fire in a factory; though microposts for mative instances are chosen using an entropy-based uncer-
both events need to be classified with the “fire” event type, tainty sampling. However, as building the social network
state-of-the-art approaches would not distinguish these two is time consuming and not always possible due to API re-
events and thus could not yield an optimal selection of in- strictions, their approach is not applicable for our problem.
stances to label. For better distinguishing events, a straight- Also, they do not use event-related metadata.
forward approach is to characterize an event not only by its
Several selection strategies were presented that propose
type, but also by spatial and temporal information. Pro-
to select informative as well as representative instances.
ceeding this way, the two example events are inherently
(Tang et al., 2002) used k-means clustering and proposed to
assigned to different clusters and hence instances to be la-
select the most uncertain instance for each cluster. Informa-
beled are drawn from both of them.
tion density was then used to weight instances. (Shen et al.,
Consequently, we contribute an event-based clustering ap- 2004) applied k-means clustering and uncertainty sampling
proach that also leverages the temporal and spatial dimen- and used the information density calculated within a clus-
sion of tweets to allow a more fine-grained clustering. Due ter. (Donmez et al., 2007) combined uncertainty sampling
to smaller clusters the selection of appropriate instances is and k-Medoid to identify representative as well as informa-
easier because one can assume that even with a bad sam- tive instances and showed that this combination is indeed
pling the selected instances will still be of high quality. beneficial.
The evaluation on incident-related tweets shows that this
The approach of (Zhu et al., 2008) is the most advanced re-
enhanced clustering indeed improves the selection com-
lated approach when it comes to combining representative-
pared to state-of-the-art approaches. It is also shown that
ness and informativeness, thus, we used it as a foundation
our approach has a good performance even when only few
for our technique. The authors employed clustering for the
examples are labeled.
initial selection. Uncertainty sampling is combined with
In summary, the contributions of this paper are: (1) A study estimating a density for each iteration. Unlike their work,
comparing the labeling quality of experts and non-experts we apply our event-based clustering also for the iterations.
showing no significant difference of error rates. (2) A novel (Huang et al., 2010) followed a similar approach. Instances
event-based clustering approach that makes use of spatial, are selected based on clustering and on confidence in pre-
temporal, and thematic information present in microposts. dicting a class label as informativeness measure. Though
The clustering benefits strongly from these additional di- their approach is quite promising, the authors stated that it
mensions. (3) A comparison of our approach using differ- is restricted to binary classification, whereas we are able to
ent number of annotators and different levels of noise. Even classify multiple classes.
with a classifier that was not explicitly build to be robust,
Taking labeling quality into account is still open to re-
noise does not hinder the classifier much.
search. Up to now, there is no study of labeling quality

98
Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts

of event-related tweets, but only studies on structured texts
Table 1. Results for the random error evaluated in a study on qual-
such as the work of (Hsueh et al., 2009). Since 2008, the
ity of crowdsourced labels. Means (µ) and standard deviation
active learning community also tackled the problem of dif- (SD) of the error rates are displayed for each user group.
ferent reliabilities of oracles (Donmez & Carbonell, 2008;
Random Error
Zhao et al., 2011; Wallace et al., 2011). These approaches
Crowd Expert
have been proposed to take labeling uncertainty into ac- µ 0.0338 0.0323
count and show that repeated re-labeling of wrongly la- SD 0.0006 0.0002
beled tweets could improve label quality and model quality.
Nevertheless, most often synthetic error rates have been as-
sumed. incident related tweets.2 For our evaluation, we used 1,200
tweets for training and 800 tweets for testing (temporal
To sum up, some works tried to combine informative-
split, i.e., the testing instances are later in time than the
ness and representative for selecting instances and showed
training instances). Though this selection might seem arbi-
promising results. Nevertheless, none of these approaches
trary, all compared algorithms rely on the same sampling,
has been evaluated on microposts or has taken event-related
thus, allowing for a fair comparison.
metadata into account. Also, no information about real-
world error rates is present or was used in active learning.
4. Study on Quality of Crowdsourced Labels
3. Developing Ground Truth Data In active learning, most often a perfect oracle is assumed
for labeling instances. As this might not hold true in a real-
In this section, we present our dataset used for our evalua-
world environment, we conducted a study on labeling ac-
tion. We focus on incident-related tweets as a specific type
curacy. When it comes to labeling accuracy, the general
of event-related data. We differentiate between three inci-
assumption is that labeling quality in crowdsourcing envi-
dent types in order to classify microposts. These have been
ronments might be dependent on the domain knowledge of
chosen because we identified them as the most common in-
the annotators (Zhao et al., 2011). Thus, one of the goals of
cident types in the Seattle Fire Calls dataset1 , which is a
the study is to analyze if the labeling quality of non-experts
frequently updated source for official incident information.
differs significantly from domain experts. To answer this
We also add one neutral class, thus, our final classes are:
question, we evaluated two user groups in our study: do-
car crash, fire, shooting, and no incident.
main experts and regular crowd users with no or limited
As there are no publicly available labeled datasets for domain knowledge. Second, there is no work describing er-
event-related microposts, we needed to create our own ror rates for labeling of incident-related microposts. Thus,
high-quality ground truth data. For this, we collected En- we want to quantify the error rates, so we can use them for
glish microposts using the Twitter Search API. For the col- our simulations. For this, we evaluated the random error,
lection, we used a 15km radius around the city centers of i.e., the error that results from the annotator carelessness.
Seattle, WA and Memphis, TN. We focused on only two E.g., a wrong label is occasionally assigned. The random
cities, as for our analysis we were interested in a large error is regarded as i.i.d. noise on each label, thus, we as-
stream of tweets for a specific time period of certain ar- sume a fixed probability RE ∈ [0, 1].
eas instead of a world-wide scattered sample. This gave us
We assume a different labeling quality for crowd users
a set of 7.5M microposts from Nov. 19th, 2012 until Feb.
(CU) and domain experts (EX) and test the following
7th, 2013. Although the datasets have been collected in dif-
hypothesis H: The means (µ) of the random error are
ferent time periods, we do not expect any difference in the
different across both user groups (H0 : µRE,CU =
way people post about incidents.
µRE,EX , HA : µRE,CU 6= µRE,EX ).
As this initial set was used for conducting our experiments,
We created a survey to conduct the labeling of our complete
we had to further reduce the size of the datasets follow-
ground truth dataset according to the incident types. Four-
ing our approach as described in (Schulz et al., 2013b).
teen users participated in the study. Eight participants were
The resulting 2,000 tweets were manually labeled by four
crowd users with no or low experience in the crisis man-
domain-experts using an online survey. To assign the final
agement domain and six users were domain experts with
coding, at least three coders had to agree on a label. In-
more than three years experience in the domain. At least
stances without an agreement were further examined and
three crowd users and at least two domain experts labeled
relabeled during a group discussion. The final dataset con-
each tweet. Based on the results, we calculate the random
sists of 328 fire, 309 crash, 334 shooting, and 1029 not
error (cf. Table 1) compared to the ground truth labels.
1
http://data.seattle.gov 2
All datasets will be published at http://www.doc.
gold.ac.uk/˜cguck001/IncidentTweets/

9e
Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts

For evaluating our hypothesis, we first confirmed normal Here, microposts are processed with standard Natural Lan-
distribution for all error types and both user groups using guage Processing (NLP) techniques such as stopword re-
the Anderson-Darling as well as the Shapiro-Wilk Normal- moval, POS-tagging, and lemmatization. Afterwards, sev-
ity test. Furthermore, we conducted a two-sample F-test eral features are extracted from the preprocessed instances
for variances to verify same variances for all combinations such as word-3-grams after POS-filtering, TF-IDF scores,
with p < 0.01. For each combination we conducted the syntactic features as well as semantic features. The syn-
two-sample t-test assuming equal variances. For all com- tactic features are the number of exclamation and ques-
binations the null hypotheses could not be rejected with tion marks as well as the number of upper case charac-
p < 0.01. Thus, for all error types, we cannot assume a ters. The semantic features are a feature group derived us-
difference between both user groups. This means that in ing different means of Semantic Abstraction (Schulz et al.,
our study there is no conceivable difference between do- 2015). Furthermore, the existing approach allows us to ex-
main experts and common crowd users. tract a likely date of an event mentioned in a micropost. To
identify the temporal information in a tweet, we adapted
One reason might be the rather low sample size. Others
the HeidelTime framework for temporal extraction as pre-
might be found in the nature of microposts as they are
sented in (Schulz et al., 2013b).
short and the amount of available information per tweet is
limited. Thus, the complexity of the information is low As the number of geotagged microposts is rather low (about
and it is possible to understand the content even as a non- 1-2%), we reuse an extension of our approach for geolo-
expert. Furthermore, as tweets are send by lots of different calization (Schulz et al., 2013a) of microposts as well as
individuals, the number of domain specific terms could be for extracting location mentions as features used in the
rather low compared to specialized texts. Also, as incident- classification. For geolocalization an estimation of the
related tweets are common topics compared to physics or city and the country where a tweet was send from was
medicine, people are somehow used to the vocabulary. used and additionally location mentions extracted from the
tweet message were considered. First, we use a Stanford
To reflect a real-world situation best, we combined the re-
NER3 model to identify all location mentions. Then, the
sults of both groups as in typical crowdsourcing studies
discovered locations are geocoded using the geographical
both groups might be present. Also, the labels of the ex-
database GeoNames4 , and the MapQuest Nominatim API5
perts are available anyway for our dataset. This gave us a
for more fine-grained locations, like streets. The intersec-
final error rate of 0.0331 for the random error.
tion of all locations extracted from the tweet is used as an
estimation of the location where an event mentioned in a
5. Event-Based Clustering tweet has happened.
In this section, we show how active learning can be utilized After the initial training, the classifier is retrained in several
to classify the incident type of microposts. We also intro- iterations using newly labeled instances. After each itera-
duce our approach and present how we cope with the initial tion, the labeled instances are removed from the pool of un-
selection problem, i.e., how to select the initial training set, labeled instances U and added to L, thus, more instances
as well as with the query selection problem, i.e., how to can be used for learning. A selection strategy is used on
choose appropriate instances for labeling in each iteration. U to query labels for a number of instances in each itera-
tion. For coping with this query selection problem, several
5.1. Active Learning for Event Type Classification strategies can be chosen based on informativeness and rep-
resentativeness (Huang et al., 2010).
Active learning is an iterative process to build classification
models by selecting small subsets of the available instances For informativeness as selection criteria, uncertainty sam-
to label. Two major steps are conducted: (1) a learning pling (Lewis & Catlett, 1994) is commonly applied that
step, where a classifier is built and (2) an improvement step, selects particularly those examples for labeling for which
in which the classifier is optimized. We follow a pool-based the learner is most uncertain. However, the main issue
sampling approach. First, a large number of microposts are with the informativeness approach is that only a single in-
collected as an initial pool of unlabeled data U . From this stance is considered at a time (Settles, 2012). Thus, out-
information base, a set of training examples L is chosen liers could be selected erroneously as the context is not
for learning an initial model. It is highly important how to taken into account. In contrary, clustering helps to identify
choose this set, because with a well-selected initial training representative instances. According to Nguyen and Smeul-
set, the learner can reach higher performance faster with ders (Nguyen & Smeulders, 2004), the most representative
fewer queries (Kang et al., 2004). 3
http://nlp.stanford.edu
4
For training a classifier using this initial set, we reuse the http://www.geonames.org/
5
classification approach presented in (Schulz et al., 2013b). http://developer.mapquest.com

9d
Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts

examples are those in the center of cluster, which are the Algorithm 1 Algorithm for initial selection strategy.
instances most similar to all other instances in the clus- Data: Unlabeled instances U , Clusters C generated by event-based clustering,
Size of initial training set bi
ter. Nevertheless, selecting always the centers of the clus- Result: Instances to label L
ters might result in selecting always very similar instances
for all clusters c ∈ C do
for each iteration, thus, the model might not improve very for all instances i ∈ c do
much. Furthermore, it remains unclear how many clusters Calculate information density DS(i)
end for
have to be built. Also, the resulting clusters not necessarily end for
correlate to the real-world events as spatial and temporal for all clusters c ∈ C do
Calculate average information density DSC(c)
information is neglected. end for
Order clusters in C based on DSC
To overcome the individual problems of each approach, re- while |L| ≤ bi do
lated work proposes to select the most informative and rep- for cluster c ∈ C do
Add one instance from c to L
resentative instances. This results in selecting the instances end for
that are representative for the whole dataset as well as have end while

the highest chance to improve the model. In our approach,
we use metadata provided in microposts to cluster instances
based on both criteria and to choose the most valuable in- serts that each incoming micropost of the event type Car
stances for training the classifier. The whole process of Crash is aggregated to a previously reported incident if it is
active learning continues until a stopping criteria is met, of the same type, within a range of 200 meters, and within a
e.g., a maximum number of iterations is reached or when time of 20 minutes. Clearly, altering the radius or the time
the model does not improve any more. will have a strong effect on the final clustering. However, as
emergency management experts suggested to use these val-
5.2. Event-based Clustering ues, we did not change them. Inspecting the effects of dif-
ferent parameterizations remains subject for future work,
Clustering-based approaches are frequently used for iden-
however, we are confident that our proposed approach is
tifying representative instances. However, there might not
not affected negatively by a change of these parameters.
be an obvious clustering of event-related data, thus, clus-
With the help of these three assertion types, a rule engine
tering might be performed at various levels of granularity
computes whether microposts are clustered as they describe
as the optimal number of cluster is unknown.
the same event or not.
Consequently, we use event-related information such as
Microposts containing no thematic information are as-
temporal and spatial information in combination with the
signed the unknown event type. Missing spatial informa-
event type to perform an event-based clustering to take the
tion is replaced with a common spatial center (the center
properties of real-world events into account. This way, we
of a city). Missing temporal information is replaced with
are directly able to find a number of clusters without the
the creation date of the micropost. Thus, even with one or
need of specifying the number beforehand. Furthermore,
two missing dimensions, we are still able to build clusters.
our event-based clustering is based on both selection crite-
Based on this clustering approach, we are able to cluster all
ria, so we overcome the limitations of each individual one.
microposts related to a specific event. This helps to identify
The design of our approach follows the assumption that those microposts that might be helpful for better training.
every event-related information is either related to a real- Opposed, microposts not related to events are assigned to
world event or not. Thus, we propose to cluster all in- larger clusters, containing lots of noise and being less valu-
stances based on the three dimensions that define an event: able for the learning process.
temporal, spatial and thematic extent. As a result, each in-
stance is aggregated to a cluster. 5.3. Initial Selection Strategy
If a micropost lies within the spatial, temporal, and The initial dataset that needs to be labeled is selected first.
thematic extent of another micropost, it is assumed Related approaches rely on random sampling or clustering
to provide information about the same event. This techniques (Zhu et al., 2008). However, this does not guar-
assertion can be formalized as a triple of the form antee the selection of appropriate instances, because the
{event type, radius, time}. The spatial extent is a radius initial sample size is rather small, whereas the size of the
in meters drawn around the spatial location of the event. clusters is large. In contrast, event-based clustering uses the
The temporal extent is a timespan in minutes calculated properties of real-world events to perform an initial cluster-
from the creation time of the initial event. The thematic ex- ing.
tent is the type of an event. For example, for our approach
we use the rule {Car Crash, 200m, 20min}, which as- Our approach for selecting the initial dataset is shown in
Algorithm 1. Based on the set of clusters resulting from our

93
Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts

event-based clustering, the most representative instances Algorithm 2 Algorithm for one iteration of the query se-
for the complete and unlabeled dataset are identified for lection strategy.
training the initial model. For this, we use the event clus- Data: Unlabeled instances U , Labeled instances L, Clusters C generated by
event-based clustering, Number of instances to label per iteration bi , Trained
ters ordered by information density of their containing in- Model for iteration M , Mean average size of all cluster in iteration ms
stances to obtain a good initial set. Selecting informative Result: Instances to label L
instances clearly is not possible yet, as a classifier cannot Use L to train classifier M
be trained at this point. In the following, we describe the for all clusters c ∈ C do
for all instances i ∈ c do
algorithm in detail. Calculate information density DS(i)
Calculate entropy H(i) using M
First, our clustering approach is applied on the complete Calculate density×entropy measure DSH(i)
unlabeled set U without a thematic specification as this is end for
end for
not present yet. Thus, the unknown event type is used. for all clusters c ∈ C do
Second, for all instances in each cluster the information Calculate DSHC(c)
end for
density is calculated. This is done based on how many in- Order clusters based on DSHC
stances are similar or near to each other, thus, outliers are while |L| ≤ bi do
for all clusters c ∈ C do
regarded as less valuable. We used a k-Nearest-Neighbor- n = logms (|c|)
based density estimation (Zhu et al., 2008): DS (x) = Add n instances from c to L
P end for
Similarity(x,s) end while
s∈S(x)
k

The density DS(x) of instance x is estimated based on
the k most similar instances in the same cluster6 S(x) = ∈ Y = {y1 , y2 , ..., yi } was employed: H(x) =
y P
{s1 , s2 , ..., sk }. As a similarity measure, we use the cosine − yǫY P (y|x) log P (y|x)
similarity between two instances. The information density
DSC of each cluster c is then calculated based on the aver- Based on the information density and the entropy, the
age of the information density of each instance as follows: density×entropy measure DSH(x) = DS(x) × H(x)
P
DS(x) (Zhu et al., 2008) is calculated for each instance x. The in-
DSC (c) = x∈c
k formativeness and representativeness of each cluster is then
Doing this, we are able to avoid noisy clusters with lots of computed based on the mean average ofPDSH of each in-
DSH(i)
unrelated items, which would typically be clusters not re- stance i in the cluster c: DSHC (c) = i∈c
|c|
lated to an event. Based on DSC(c) the clusters are sorted.
Then we iterate over the ordered list and select instances For selecting the appropriate instances to query, the clusters
until bi (initial training size) instances are selected. Pro- are sorted by the DSHC of each cluster. The number of in-
ceeding this way, we achieve a good distribution over all stances to draw per cluster is calculated as n = log(ms) CS.
valuable event clusters as it is guaranteed that the instances To determine how many instances have to be selected per
are selected from the most representative clusters. Based cluster (n), we calculate the average size of all clusters ms
on these instances, the initial model is build. and the size of the current cluster CS. We decided to use a
logarithmic scale by using a logarithm at basis ms to avoid
5.4. Query Selection Strategy drawing too many instances from larger clusters as would
be the case with a linear approach. We assume that draw-
For the query selection strategy we choose representative ing only small numbers per cluster is sufficient, as at some
using clustering as well as informative instances using point additional instances will not yield any additional in-
uncertainty-based sampling. The pseudo-code is shown in formation, as the instances will be too similar to each other.
Algorithm 2. In every iteration, the classifier trained on the Instances are selected until the number of instances to la-
currently labeled instances is applied to label all unlabeled bel per iteration is reached. Based on the previous and the
instances. As a result, every instance is assigned a thematic new instances the model is retrained. The whole process is
dimension. Based on this, the event clustering is applied repeated until all iterations are finished.
using the spatial, temporal, and thematic information re-
sulting in a set of clusters C.
6. Experiments
Next, for the query selection strategy, we calculate the in-
formation density DS per instance. For identifying in- We conducted two experiments regarding incident type
formative instances, we use the instances for which the classification. First, we compared related approaches
classifier is most uncertain. As an uncertainty measure to show that event-based clustering outperforms other
the entropy calculated for each instance x and each class clustering-based active learning approaches. Additionally,
the effect of the number of labeled instances on the clas-
6
k is equal to the number of instances in the cluster. sifier performance is examined. In the second experiment,

9N
Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts

a) 1 user without noise b) 1 user with noise c) 5 users with noise
90 90 90

85 85 85
F1 Score

F1 Score

F1 Score
80 80 80

75 75 75

70 70 70

65 65 65
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200
# of Instances # of Instances # of Instances
(a) ground truth data (b) one annotator (c) five annotators
e) 20 users with noise f) 50 users with noise h) 200 users with noise
90 90 90

85 85 85
F1 Score

F1 Score
F1 Score

80 80 80

75 75 75

70 70 70

65 65 65
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200

# of Instances # of Instances # of Instances
(d) 20 annotators (e) 50 annotators (f) 200 annotators
Tang et al. Uncertainty Sampling Event-based Clustering Zhu et al.

Figure 1. Evaluation results of state-of-the-art selection strategies and our approach. The graphs for different number of annotators
(regular crowd users) are shown. Note that the more annotators labeled an instance the lower is the probability for a noisy labeling.

the influence of noise is inspected. Noise results from a 6.2. Algorithms and Parameters
low number of non-expert labelers. For keeping costs low
In order to evaluate the performance of our approach, we
a classifier should not be affected by a bad labeling.
re-implemented the following related approaches:
6.1. Classification and Metrics (Tang et al., 2002): For initial sampling a k-means clus-
tering is used. For query selection, first the most uncertain
The active learning algorithms select instances from the instances for each cluster are selected. Then, information
training set to query for labels. Based on these, a classi- density is used to weight the examples. We set k = 4,
fier was trained and evaluated on the test set. As classifier because we have four different event types.
we used Weka’s implementation of John Platt’s sequential
minimal optimization (SMO) algorithm for training a sup- (Zhu et al., 2008): For initial sampling a k-means cluster-
port vector machine (Platt, 1998). Due to the complexity of ing is used (k = 4). During the iterations, the entropy ×
determining best parameters for each iteration and each ap- density measure is used as selection criteria and no cluster-
proach, we followed related approaches (see (Huang et al., ing is applied.
2010) and (Donmez et al., 2007)), and decided to compare Uncertainty: Random instances (initial) and the entropy-
all algorithms on fixed parameters. Consequently, the SVM based uncertainty sampling (iterations) is used.
was used with standard settings.
Event-based clustering: Our event-based clustering is ap-
For comparison, the deficiency metric (Raghavan et al., plied with a spatial extent of 200m and a temporal extent
2006) is calculated using the achieved F1 score of all it- of 20min.78
erations of a reference baseline algorithm (REF) and the
compared active learning approach (AL). The result is nor- Following the experimental settings of (Hu et al., 2013) and
malized using the largest F1 score and the learning curve (Huang et al., 2010), we set the size of the initial train-
of the reference algorithm REF. Thus, the measure is non- ing set and the size during the iterations to 50. No further
negative and values smaller than 1 indicate more efficient tuning or parameterization was applied. Each iteration for
algorithms compared to the baseline strategy, whereas a 7
As a result, the 1,200 tweets of the training set are divided
value larger than 1 indicates a performance decrease com- into 438 distinct event clusters.
pared to the baseline strategy. 8
The spatial and temporal extent are a result of discussions
with emergency managers.

8y
Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts

respective five annotators are shown. As can be seen in the
Table 2. Deficiencies with Tang et al. as a baseline strategy.
curve of the approach of Tang et al., the influence of noise is
Approach Deficiency notable in the big drop with 500 instances. Also Zhu et al.’s
(Tang et al., 2002) 1 approach has a much lower initial F1 score compared to
Uncertainty Sampling 0.53 all others, which is an indicator for an inappropriate initial
(Zhu et al., 2008) 0.90 selection strategy. The results indicate that even with noisy
Event-based Clustering 0.44 labels, our approach outperforms the state-of-the-art as the
situation in the graphs of the lower part of Figure 1 does
each algorithm was repeated 10 times, as for instance, the not change much. In all these cases, the learning curves are
uncertainty approach is highly dependent on the selected quite similar, which is a result of the decreased number of
instances. We used averaged F1 based on the repetitions. wrongly labeled instances. Clearly, the performance of all
approaches increases with a lower number of errors.
6.3. Comparison to state-of-the-art approaches
As we showed, our approach outperforms related work also
The performance graph for the ground truth data is shown if noise is taken into account. Not surprisingly, we found
in Figure 1 (a). Note that the x-axis shows the total num- that with an increasing number of annotators, noise is neg-
ber of instances combining the 50 instances of the initial ligible. With only one annotator, the deficiency is worse
training set and 50 instances drawn per iteration. Thus, it- by 57% and with five annotators still worse by 26%. Even
erations from 0 to 23 are depicted. As shown, the perfor- with 50 annotators, the deficiency still is worse by 10%.
mance after selecting the initial training set is superior with For more than ten annotators, an F1 score of 85% is reached
our approach. Also, in regions where only a few instances comparably fast. With a maximum of five annotators, this
were labeled, the event-based clustering has a higher F1 level is only reached at the end of the simulation. For one
value. This shows that a high-quality selection of the itera- annotator, this maximum is never achieved. These results
tion instances is possible with our method. indicate that a minimum number of annotators is needed for
achieving good results by crowdsourced labeling tasks. In
Table 2 shows the deficiencies. With respect to the perfor- our experiments, ten annotators seem to be sufficient, while
mance of the iterations, our approach has a decreased de- in other domains with different error rates, there might be
ficiency compared to other clustering approaches (0.44 vs. a need for much more annotators.
0.53). The approach of Zhu et al. outperforms the approach
of Tang et al. in most iterations and also with respect to the
deficiency. We attribute this to the improved strategy for 7. Conclusion
query selection. A surprising result is the performance of We presented an event-based clustering strategy for event
uncertainty sampling that outperforms the other two clus- type classification of microposts and coped with several
tering strategies. Apparently, only focusing on the infor- problems of active learning in the emergency management
mativeness seems to be a good strategy for our dataset. In domain. First, it was shown that domain experts do not dif-
contrast, using the number of distinct events as the number fer significantly from regular crowd users when it comes
of clusters might not be the most efficient approach. to labeling quality. Second, we presented a novel selection
The graph also shows that our approach has a steep learn- strategy for active learning based on temporal, spatial, and
ing curve as for instance only a sixth of all instances are thematic information. Our event-based clustering that iden-
needed to achieve a F1 score of about 84%. This is espe- tifies representative as well as informative instances outper-
cially important when it comes to labeling costs, as only a forms state-of-the-art clustering approaches. On incident-
limited amount of data would need to be labeled. One can related microposts we showed that a better initial training
see a drop at 500 instances. This is most likely because set is selected as well as to appropriate instances for la-
with more instances the number of clusters is decreasing, beling in each iteration are chosen. The learning curve
thus, selecting appropriate instances is more difficult. indicated that only a sixth of all instances are needed to
achieve a F1 score of about 84%, which is especially im-
We can conclude that event-based clustering that takes rep- portant when it comes to labeling costs, as only a limited
resentative as well as informative instances into account is amount of data would need to be labeled to achieve good
a promising strategy for active learning. We also showed classification results.
that our approach outperforms state-of-the-art for selecting
an initial training set and for choosing appropriate instances In the future, we aim at using our active learning frame-
for labeling in each iteration. work in addition to labeling of single features. Further-
more, though our framework follows a general approach,
Influence of noise in the labels In Figure 1 (b) and 1 (c), we only evaluated it on incident-related data, thus, we also
the learning curves for the very error-prone cases with one want so show the applicability on other types of events.

8R
Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts

References Schulz, Axel, Ristoski, Petar, and Paulheim, Heiko. I see
a car crash: Real-time detection of small scale incidents
Donmez, Pinar and Carbonell, Jaime G. Proactive learning:
in microblogs. In The Semantic Web: ESWC 2013 Satel-
cost-sensitive active learning with multiple imperfect or-
lite Events, volume 7955 of Lecture Notes in Computer
acles. In CIKM’08, pp. 619–628, 2008.
Science, pp. 22–33. Springer Berlin Heidelberg, 2013b.
Donmez, Pinar, Carbonell, Jaime G., and Bennett, Paul N. Schulz, Axel, Guckelsberger, Christian, and Janssen, Fred-
Dual strategy active learning. In ECML’07, pp. 116–127, erik. Semantic abstraction for generalization of tweet
2007. classification: An evaluation on incident-related tweets.
In Semantic Web Journal: Special Issue on The Role of
Hoi, Steven C. H., Jin, Rong, and Lyu, Michael R. Large-
Semantics in Smart Cities (to appear). IOS Press, 2015.
scale text categorization by batch mode active learning.
In WWW’06, pp. 633–642, 2006. Settles, Burr. Active Learning. Synthesis Lectures on Ar-
tificial Intelligence and Machine Learning. Morgan &
Hsueh, Pei-Yun, Melville, Prem, and Sindhwani, Vikas. Claypool Publishers, 2012.
Data quality from crowdsourcing: A study of annotation
selection criteria. In NAACL HLT’09, pp. 27–35, 2009. Shen, Dan, Zhang, Jie, Su, Jian, Zhou, Guodong, and
Tan, Chew-Lim. Multi-criteria-based active learning for
Hu, Xia, Tang, Jiliang, Gao, Huiji, and Liu, Huan. Actnet: named entity recognition. In ACL’04, 2004.
Active learning for networked texts in microblogging. In
SIAM’13, 2013. Tang, Min, Luo, Xiaoqiang, and Roukos, Salim. Ac-
tive learning for statistical natural language parsing. In
Huang, Sheng-Jun, Jin, Rong, and Zhou, Zhi-Hua. Active ACL’02, pp. 120–127, 2002.
learning by querying informative and representative ex-
amples. In NIPS, pp. 892–900, 2010. Thongsuk, Chanattha, Haruechaiyasak, Choochart, and
Meesad, Phayung. Classifying business types from twit-
Kang, Jaeho, Ryu, Kwang R., and Kwon, Hyuk C. Using ter posts using active learning. In I2CS’10, pp. 180–189,
Cluster-Based Sampling to Select Initial Training Set for 2010.
Active Learning in Text Classification. In PAKDD’04,
Tong, Simon and Koller, Daphne. Support vector machine
pp. 384–388, 2004.
active learning with applications to text classification. J.
Lewis, David D. and Catlett, Jason. Heterogeneous uncer- Mach. Learn. Res., 2:45–66, 2002.
tainty sampling for supervised learning. In ICML’94, pp. Wallace, Byron C., Small, Kevin, Brodley, Carla E., and
148–156, 1994. Trikalinos, Thomas A. Who should label what? instance
Nguyen, Hieu T. and Smeulders, Arnold. Active learning allocation in multiple expert active learning. In SDM’11,
using pre-clustering. In ICML’04, pp. 79–86, 2004. 2011.

Zhao, Liyue, Sukthankar, G., and Sukthankar, R. In-
Platt, J. Fast training of support vector machines using
cremental Relabeling for Active Learning with Noisy
sequential minimal optimization. In Schoelkopf, B.,
Crowdsourced Annotations. In SocialCom’11, pp. 728–
Burges, C., and Smola, A. (eds.), Advances in Kernel
733, 2011.
Methods - Support Vector Learning. MIT Press, 1998.
Zhu, Jingbo, Wang, Huizhen, Yao, Tianshun, and Tsou,
Raghavan, Hema et al. Active learning with feedback on Benjamin K. Active learning with sampling by uncer-
both features and instances. J. of Machine Learning Re- tainty and density for word sense disambiguation and
search, 7:1655–1686, 2006. text classification. In COLING’08, pp. 1137–1144,
2008.
Schulz, Axel. Mining User-Generated Content for Inci-
dents. PhD thesis, TU Darmstadt, 2014. URL http:
//tuprints.ulb.tu-darmstadt.de/4107/.

Schulz, Axel, Hadjakos, Aristotelis, Paulheim, Heiko,
Nachtwey, Johannes, and Mühlhäuser, Max. A multi-
indicator approach for geolocalization of tweets. In Pro-
ceedings of the Eight International Conference on We-
blogs and Social Media (ICWSM), pp. 573–582, Menlo
Park, California, USA, 2013a. AAAI Press.