=Paper=
{{Paper
|id=Vol-1392/paper-06
|storemode=property
|title=Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts
|pdfUrl=https://ceur-ws.org/Vol-1392/paper-06.pdf
|volume=Vol-1392
|dblpUrl=https://dblp.org/rec/conf/icml/SchulzRFJ15
}}
==Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts==
Event-Based Clustering for Reducing Labeling Costs of Incident-Related Microposts Axel Schulz SCHULZ . AXEL @ GMX . NET DB Mobility Logistics AG, Germany and Telecooperation Lab, Technische Universität Darmstadt, Germany Petar Ristoski PETAR . RISTOSKI @ INFORMATIK . UNI - MANNHEIM . DE Data and Web Science Group, University of Mannheim, Germany Johannes Fürnkranz JUFFI @ KE . INFORMATIK . TU - DARMSTADT. DE Knowledge Engineering Group, Technische Universität Darmstadt, Germany Frederik Janssen JANSSEN @ KE . TU - DARMSTADT. DE Knowledge Engineering Group, Technische Universität Darmstadt, Germany Abstract 1. Introduction Automatically identifying the event type of Detecting event-related information in microposts has event-related information in the sheer amount of shown its value for a variety of domains. Especially in social media data makes machine learning in- emergency management, different situational information evitable. However, this is highly dependent on is present that could contribute to understand the situation (1) the number of correctly labeled instances and at hand (Schulz, 2014). However, solving the actual prob- (2) labeling costs. Active learning has been pro- lem of classifying the incident type in this domain requires posed to reduce the number of instances to la- labeled data. One of the main problems with microposts bel. Though, current approaches focus on the is acquiring ground truth for utilizing supervised learning. thematic dimension, i.e., the event type, for se- Thus, we deal with two major issues: (1) The costs for la- lecting instances to label; other metadata such beling a single instance, and (2) the number of instances to as spatial and temporal information that is help- label. ful for achieving a more fine-grained clustering On the one hand, to actually build a classifier that is able is currently not taken into account. Also, label- to accurately predict the type of the incident mentioned in ing quality is always assumed to be perfect as a tweet, usually experts are deployed for labeling as they currently no qualitative information is present for have enough domain knowledge to create ground truth. manual event type labeling. However, as often several hundreds of examples have to be In this paper, we present a novel event-based labeled until the classifier is able to reach sufficient qual- clustering strategy that makes use of temporal, ity, relying on experts for labeling is not always possible spatial, and thematic metadata to determine in- and it is costly. In contrast, labels can also be derived from stances to label. Furthermore, we also inspect the non-experts, i.e., by making use of crowdsourcing. Given quality of the manual labeling in a crowdsourcing that the labels obtained in this way are of sufficient quality, study by comparing experts and non-experts. An the costs for such a process would be acceptable as crowd- evaluation on incident-related tweets shows that sourcing is rather cheap. But up to now there is no infor- (i) labels provided by crowdsourcing are of ac- mation about labeling quality for incident-related tweets. ceptable quality and (ii) our selection strategy for Hence, first we proceeded by comparing the labeling qual- active learning outperforms current state-of-the- ity of experts and non-experts. art approaches even with few labeled instances. On the other hand, the number of instances to label has to be kept as low as possible. Due to the huge number of Proceedings of the 2 nd International Workshop on Mining Urban tweets, labeling all instances is not possible as even with Data, Lille, France, 2015. Copyright c 2015 for this paper by its cheap labeling the costs would explode. Keeping the num- authors. Copying permitted for private and academic purposes. ber of instances to label low while maintaining accurate 99 Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts classifiers is a typical active learning (Settles, 2012) prob- We begin with summarizing related approaches. Next, we lem. Here, labeling costs are reduced by iteratively (1) se- show how the ground truth data was developed. Then we lecting small subsets of instances to query for labels and (2) summarize the results of the study on crowdsourced labels re-training a classifier with the newly labeled data. Thus, (Section 4) followed by a description of the event-based in general, but also specifically for classifying microposts, clustering for active learning. After, the results are shown there are two issues to solve, namely selecting a good initial and discussed (Section 6) and the paper is concluded. training set and the right instances in each iteration. For selecting appropriate instances, several selection strate- 2. Related Work gies have been proposed based on the two criteria, informa- Although active learning has been studied extensively for tiveness and representativeness (Huang et al., 2010). Infor- text classification (Hoi et al., 2006; Tong & Koller, 2002), mativeness measures the usefulness of an instance to re- it was used for tweets only by a few previous works. duce the uncertainty of the model, whereas representative- (Thongsuk et al., 2010) presented a technique for classi- ness measures how good an instance represents the overall fying tweets into three business types. They showed that input of unlabeled data. The latter usually is solved by em- using active learning outperforms simple supervised learn- ploying clustering approaches where then from each cluster ing approaches in terms of labeling costs. the representative instances are drawn. Indeed, for event- type classification the number of clusters to build is not (Hu et al., 2013) presented the ActNeT approach, which known in advance, as it is unknown how often an event oc- takes the relations between tweets into account for identify- curred. Hence, most often it is set to the number of distinct ing representative as well as informative instances. Based event types, which obviously is not appropriate. For in- on a social network, the topology is used to detect repre- stance, one event might be a tiny fire in a waste bin whereas sentative instances using the PageRank algorithm. Infor- another is a huge fire in a factory; though microposts for mative instances are chosen using an entropy-based uncer- both events need to be classified with the “fire” event type, tainty sampling. However, as building the social network state-of-the-art approaches would not distinguish these two is time consuming and not always possible due to API re- events and thus could not yield an optimal selection of in- strictions, their approach is not applicable for our problem. stances to label. For better distinguishing events, a straight- Also, they do not use event-related metadata. forward approach is to characterize an event not only by its Several selection strategies were presented that propose type, but also by spatial and temporal information. Pro- to select informative as well as representative instances. ceeding this way, the two example events are inherently (Tang et al., 2002) used k-means clustering and proposed to assigned to different clusters and hence instances to be la- select the most uncertain instance for each cluster. Informa- beled are drawn from both of them. tion density was then used to weight instances. (Shen et al., Consequently, we contribute an event-based clustering ap- 2004) applied k-means clustering and uncertainty sampling proach that also leverages the temporal and spatial dimen- and used the information density calculated within a clus- sion of tweets to allow a more fine-grained clustering. Due ter. (Donmez et al., 2007) combined uncertainty sampling to smaller clusters the selection of appropriate instances is and k-Medoid to identify representative as well as informa- easier because one can assume that even with a bad sam- tive instances and showed that this combination is indeed pling the selected instances will still be of high quality. beneficial. The evaluation on incident-related tweets shows that this The approach of (Zhu et al., 2008) is the most advanced re- enhanced clustering indeed improves the selection com- lated approach when it comes to combining representative- pared to state-of-the-art approaches. It is also shown that ness and informativeness, thus, we used it as a foundation our approach has a good performance even when only few for our technique. The authors employed clustering for the examples are labeled. initial selection. Uncertainty sampling is combined with In summary, the contributions of this paper are: (1) A study estimating a density for each iteration. Unlike their work, comparing the labeling quality of experts and non-experts we apply our event-based clustering also for the iterations. showing no significant difference of error rates. (2) A novel (Huang et al., 2010) followed a similar approach. Instances event-based clustering approach that makes use of spatial, are selected based on clustering and on confidence in pre- temporal, and thematic information present in microposts. dicting a class label as informativeness measure. Though The clustering benefits strongly from these additional di- their approach is quite promising, the authors stated that it mensions. (3) A comparison of our approach using differ- is restricted to binary classification, whereas we are able to ent number of annotators and different levels of noise. Even classify multiple classes. with a classifier that was not explicitly build to be robust, Taking labeling quality into account is still open to re- noise does not hinder the classifier much. search. Up to now, there is no study of labeling quality 98 Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts of event-related tweets, but only studies on structured texts Table 1. Results for the random error evaluated in a study on qual- such as the work of (Hsueh et al., 2009). Since 2008, the ity of crowdsourced labels. Means (µ) and standard deviation active learning community also tackled the problem of dif- (SD) of the error rates are displayed for each user group. ferent reliabilities of oracles (Donmez & Carbonell, 2008; Random Error Zhao et al., 2011; Wallace et al., 2011). These approaches Crowd Expert have been proposed to take labeling uncertainty into ac- µ 0.0338 0.0323 count and show that repeated re-labeling of wrongly la- SD 0.0006 0.0002 beled tweets could improve label quality and model quality. Nevertheless, most often synthetic error rates have been as- sumed. incident related tweets.2 For our evaluation, we used 1,200 tweets for training and 800 tweets for testing (temporal To sum up, some works tried to combine informative- split, i.e., the testing instances are later in time than the ness and representative for selecting instances and showed training instances). Though this selection might seem arbi- promising results. Nevertheless, none of these approaches trary, all compared algorithms rely on the same sampling, has been evaluated on microposts or has taken event-related thus, allowing for a fair comparison. metadata into account. Also, no information about real- world error rates is present or was used in active learning. 4. Study on Quality of Crowdsourced Labels 3. Developing Ground Truth Data In active learning, most often a perfect oracle is assumed for labeling instances. As this might not hold true in a real- In this section, we present our dataset used for our evalua- world environment, we conducted a study on labeling ac- tion. We focus on incident-related tweets as a specific type curacy. When it comes to labeling accuracy, the general of event-related data. We differentiate between three inci- assumption is that labeling quality in crowdsourcing envi- dent types in order to classify microposts. These have been ronments might be dependent on the domain knowledge of chosen because we identified them as the most common in- the annotators (Zhao et al., 2011). Thus, one of the goals of cident types in the Seattle Fire Calls dataset1 , which is a the study is to analyze if the labeling quality of non-experts frequently updated source for official incident information. differs significantly from domain experts. To answer this We also add one neutral class, thus, our final classes are: question, we evaluated two user groups in our study: do- car crash, fire, shooting, and no incident. main experts and regular crowd users with no or limited As there are no publicly available labeled datasets for domain knowledge. Second, there is no work describing er- event-related microposts, we needed to create our own ror rates for labeling of incident-related microposts. Thus, high-quality ground truth data. For this, we collected En- we want to quantify the error rates, so we can use them for glish microposts using the Twitter Search API. For the col- our simulations. For this, we evaluated the random error, lection, we used a 15km radius around the city centers of i.e., the error that results from the annotator carelessness. Seattle, WA and Memphis, TN. We focused on only two E.g., a wrong label is occasionally assigned. The random cities, as for our analysis we were interested in a large error is regarded as i.i.d. noise on each label, thus, we as- stream of tweets for a specific time period of certain ar- sume a fixed probability RE ∈ [0, 1]. eas instead of a world-wide scattered sample. This gave us We assume a different labeling quality for crowd users a set of 7.5M microposts from Nov. 19th, 2012 until Feb. (CU) and domain experts (EX) and test the following 7th, 2013. Although the datasets have been collected in dif- hypothesis H: The means (µ) of the random error are ferent time periods, we do not expect any difference in the different across both user groups (H0 : µRE,CU = way people post about incidents. µRE,EX , HA : µRE,CU 6= µRE,EX ). As this initial set was used for conducting our experiments, We created a survey to conduct the labeling of our complete we had to further reduce the size of the datasets follow- ground truth dataset according to the incident types. Four- ing our approach as described in (Schulz et al., 2013b). teen users participated in the study. Eight participants were The resulting 2,000 tweets were manually labeled by four crowd users with no or low experience in the crisis man- domain-experts using an online survey. To assign the final agement domain and six users were domain experts with coding, at least three coders had to agree on a label. In- more than three years experience in the domain. At least stances without an agreement were further examined and three crowd users and at least two domain experts labeled relabeled during a group discussion. The final dataset con- each tweet. Based on the results, we calculate the random sists of 328 fire, 309 crash, 334 shooting, and 1029 not error (cf. Table 1) compared to the ground truth labels. 1 http://data.seattle.gov 2 All datasets will be published at http://www.doc. gold.ac.uk/˜cguck001/IncidentTweets/ 9e Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts For evaluating our hypothesis, we first confirmed normal Here, microposts are processed with standard Natural Lan- distribution for all error types and both user groups using guage Processing (NLP) techniques such as stopword re- the Anderson-Darling as well as the Shapiro-Wilk Normal- moval, POS-tagging, and lemmatization. Afterwards, sev- ity test. Furthermore, we conducted a two-sample F-test eral features are extracted from the preprocessed instances for variances to verify same variances for all combinations such as word-3-grams after POS-filtering, TF-IDF scores, with p < 0.01. For each combination we conducted the syntactic features as well as semantic features. The syn- two-sample t-test assuming equal variances. For all com- tactic features are the number of exclamation and ques- binations the null hypotheses could not be rejected with tion marks as well as the number of upper case charac- p < 0.01. Thus, for all error types, we cannot assume a ters. The semantic features are a feature group derived us- difference between both user groups. This means that in ing different means of Semantic Abstraction (Schulz et al., our study there is no conceivable difference between do- 2015). Furthermore, the existing approach allows us to ex- main experts and common crowd users. tract a likely date of an event mentioned in a micropost. To identify the temporal information in a tweet, we adapted One reason might be the rather low sample size. Others the HeidelTime framework for temporal extraction as pre- might be found in the nature of microposts as they are sented in (Schulz et al., 2013b). short and the amount of available information per tweet is limited. Thus, the complexity of the information is low As the number of geotagged microposts is rather low (about and it is possible to understand the content even as a non- 1-2%), we reuse an extension of our approach for geolo- expert. Furthermore, as tweets are send by lots of different calization (Schulz et al., 2013a) of microposts as well as individuals, the number of domain specific terms could be for extracting location mentions as features used in the rather low compared to specialized texts. Also, as incident- classification. For geolocalization an estimation of the related tweets are common topics compared to physics or city and the country where a tweet was send from was medicine, people are somehow used to the vocabulary. used and additionally location mentions extracted from the tweet message were considered. First, we use a Stanford To reflect a real-world situation best, we combined the re- NER3 model to identify all location mentions. Then, the sults of both groups as in typical crowdsourcing studies discovered locations are geocoded using the geographical both groups might be present. Also, the labels of the ex- database GeoNames4 , and the MapQuest Nominatim API5 perts are available anyway for our dataset. This gave us a for more fine-grained locations, like streets. The intersec- final error rate of 0.0331 for the random error. tion of all locations extracted from the tweet is used as an estimation of the location where an event mentioned in a 5. Event-Based Clustering tweet has happened. In this section, we show how active learning can be utilized After the initial training, the classifier is retrained in several to classify the incident type of microposts. We also intro- iterations using newly labeled instances. After each itera- duce our approach and present how we cope with the initial tion, the labeled instances are removed from the pool of un- selection problem, i.e., how to select the initial training set, labeled instances U and added to L, thus, more instances as well as with the query selection problem, i.e., how to can be used for learning. A selection strategy is used on choose appropriate instances for labeling in each iteration. U to query labels for a number of instances in each itera- tion. For coping with this query selection problem, several 5.1. Active Learning for Event Type Classification strategies can be chosen based on informativeness and rep- resentativeness (Huang et al., 2010). Active learning is an iterative process to build classification models by selecting small subsets of the available instances For informativeness as selection criteria, uncertainty sam- to label. Two major steps are conducted: (1) a learning pling (Lewis & Catlett, 1994) is commonly applied that step, where a classifier is built and (2) an improvement step, selects particularly those examples for labeling for which in which the classifier is optimized. We follow a pool-based the learner is most uncertain. However, the main issue sampling approach. First, a large number of microposts are with the informativeness approach is that only a single in- collected as an initial pool of unlabeled data U . From this stance is considered at a time (Settles, 2012). Thus, out- information base, a set of training examples L is chosen liers could be selected erroneously as the context is not for learning an initial model. It is highly important how to taken into account. In contrary, clustering helps to identify choose this set, because with a well-selected initial training representative instances. According to Nguyen and Smeul- set, the learner can reach higher performance faster with ders (Nguyen & Smeulders, 2004), the most representative fewer queries (Kang et al., 2004). 3 http://nlp.stanford.edu 4 For training a classifier using this initial set, we reuse the http://www.geonames.org/ 5 classification approach presented in (Schulz et al., 2013b). http://developer.mapquest.com 9d Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts examples are those in the center of cluster, which are the Algorithm 1 Algorithm for initial selection strategy. instances most similar to all other instances in the clus- Data: Unlabeled instances U , Clusters C generated by event-based clustering, Size of initial training set bi ter. Nevertheless, selecting always the centers of the clus- Result: Instances to label L ters might result in selecting always very similar instances for all clusters c ∈ C do for each iteration, thus, the model might not improve very for all instances i ∈ c do much. Furthermore, it remains unclear how many clusters Calculate information density DS(i) end for have to be built. Also, the resulting clusters not necessarily end for correlate to the real-world events as spatial and temporal for all clusters c ∈ C do Calculate average information density DSC(c) information is neglected. end for Order clusters in C based on DSC To overcome the individual problems of each approach, re- while |L| ≤ bi do lated work proposes to select the most informative and rep- for cluster c ∈ C do Add one instance from c to L resentative instances. This results in selecting the instances end for that are representative for the whole dataset as well as have end while the highest chance to improve the model. In our approach, we use metadata provided in microposts to cluster instances based on both criteria and to choose the most valuable in- serts that each incoming micropost of the event type Car stances for training the classifier. The whole process of Crash is aggregated to a previously reported incident if it is active learning continues until a stopping criteria is met, of the same type, within a range of 200 meters, and within a e.g., a maximum number of iterations is reached or when time of 20 minutes. Clearly, altering the radius or the time the model does not improve any more. will have a strong effect on the final clustering. However, as emergency management experts suggested to use these val- 5.2. Event-based Clustering ues, we did not change them. Inspecting the effects of dif- ferent parameterizations remains subject for future work, Clustering-based approaches are frequently used for iden- however, we are confident that our proposed approach is tifying representative instances. However, there might not not affected negatively by a change of these parameters. be an obvious clustering of event-related data, thus, clus- With the help of these three assertion types, a rule engine tering might be performed at various levels of granularity computes whether microposts are clustered as they describe as the optimal number of cluster is unknown. the same event or not. Consequently, we use event-related information such as Microposts containing no thematic information are as- temporal and spatial information in combination with the signed the unknown event type. Missing spatial informa- event type to perform an event-based clustering to take the tion is replaced with a common spatial center (the center properties of real-world events into account. This way, we of a city). Missing temporal information is replaced with are directly able to find a number of clusters without the the creation date of the micropost. Thus, even with one or need of specifying the number beforehand. Furthermore, two missing dimensions, we are still able to build clusters. our event-based clustering is based on both selection crite- Based on this clustering approach, we are able to cluster all ria, so we overcome the limitations of each individual one. microposts related to a specific event. This helps to identify The design of our approach follows the assumption that those microposts that might be helpful for better training. every event-related information is either related to a real- Opposed, microposts not related to events are assigned to world event or not. Thus, we propose to cluster all in- larger clusters, containing lots of noise and being less valu- stances based on the three dimensions that define an event: able for the learning process. temporal, spatial and thematic extent. As a result, each in- stance is aggregated to a cluster. 5.3. Initial Selection Strategy If a micropost lies within the spatial, temporal, and The initial dataset that needs to be labeled is selected first. thematic extent of another micropost, it is assumed Related approaches rely on random sampling or clustering to provide information about the same event. This techniques (Zhu et al., 2008). However, this does not guar- assertion can be formalized as a triple of the form antee the selection of appropriate instances, because the {event type, radius, time}. The spatial extent is a radius initial sample size is rather small, whereas the size of the in meters drawn around the spatial location of the event. clusters is large. In contrast, event-based clustering uses the The temporal extent is a timespan in minutes calculated properties of real-world events to perform an initial cluster- from the creation time of the initial event. The thematic ex- ing. tent is the type of an event. For example, for our approach we use the rule {Car Crash, 200m, 20min}, which as- Our approach for selecting the initial dataset is shown in Algorithm 1. Based on the set of clusters resulting from our 93 Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts event-based clustering, the most representative instances Algorithm 2 Algorithm for one iteration of the query se- for the complete and unlabeled dataset are identified for lection strategy. training the initial model. For this, we use the event clus- Data: Unlabeled instances U , Labeled instances L, Clusters C generated by event-based clustering, Number of instances to label per iteration bi , Trained ters ordered by information density of their containing in- Model for iteration M , Mean average size of all cluster in iteration ms stances to obtain a good initial set. Selecting informative Result: Instances to label L instances clearly is not possible yet, as a classifier cannot Use L to train classifier M be trained at this point. In the following, we describe the for all clusters c ∈ C do for all instances i ∈ c do algorithm in detail. Calculate information density DS(i) Calculate entropy H(i) using M First, our clustering approach is applied on the complete Calculate density×entropy measure DSH(i) unlabeled set U without a thematic specification as this is end for end for not present yet. Thus, the unknown event type is used. for all clusters c ∈ C do Second, for all instances in each cluster the information Calculate DSHC(c) end for density is calculated. This is done based on how many in- Order clusters based on DSHC stances are similar or near to each other, thus, outliers are while |L| ≤ bi do for all clusters c ∈ C do regarded as less valuable. We used a k-Nearest-Neighbor- n = logms (|c|) based density estimation (Zhu et al., 2008): DS (x) = Add n instances from c to L P end for Similarity(x,s) end while s∈S(x) k The density DS(x) of instance x is estimated based on the k most similar instances in the same cluster6 S(x) = ∈ Y = {y1 , y2 , ..., yi } was employed: H(x) = y P {s1 , s2 , ..., sk }. As a similarity measure, we use the cosine − yǫY P (y|x) log P (y|x) similarity between two instances. The information density DSC of each cluster c is then calculated based on the aver- Based on the information density and the entropy, the age of the information density of each instance as follows: density×entropy measure DSH(x) = DS(x) × H(x) P DS(x) (Zhu et al., 2008) is calculated for each instance x. The in- DSC (c) = x∈c k formativeness and representativeness of each cluster is then Doing this, we are able to avoid noisy clusters with lots of computed based on the mean average ofPDSH of each in- DSH(i) unrelated items, which would typically be clusters not re- stance i in the cluster c: DSHC (c) = i∈c |c| lated to an event. Based on DSC(c) the clusters are sorted. Then we iterate over the ordered list and select instances For selecting the appropriate instances to query, the clusters until bi (initial training size) instances are selected. Pro- are sorted by the DSHC of each cluster. The number of in- ceeding this way, we achieve a good distribution over all stances to draw per cluster is calculated as n = log(ms) CS. valuable event clusters as it is guaranteed that the instances To determine how many instances have to be selected per are selected from the most representative clusters. Based cluster (n), we calculate the average size of all clusters ms on these instances, the initial model is build. and the size of the current cluster CS. We decided to use a logarithmic scale by using a logarithm at basis ms to avoid 5.4. Query Selection Strategy drawing too many instances from larger clusters as would be the case with a linear approach. We assume that draw- For the query selection strategy we choose representative ing only small numbers per cluster is sufficient, as at some using clustering as well as informative instances using point additional instances will not yield any additional in- uncertainty-based sampling. The pseudo-code is shown in formation, as the instances will be too similar to each other. Algorithm 2. In every iteration, the classifier trained on the Instances are selected until the number of instances to la- currently labeled instances is applied to label all unlabeled bel per iteration is reached. Based on the previous and the instances. As a result, every instance is assigned a thematic new instances the model is retrained. The whole process is dimension. Based on this, the event clustering is applied repeated until all iterations are finished. using the spatial, temporal, and thematic information re- sulting in a set of clusters C. 6. Experiments Next, for the query selection strategy, we calculate the in- formation density DS per instance. For identifying in- We conducted two experiments regarding incident type formative instances, we use the instances for which the classification. First, we compared related approaches classifier is most uncertain. As an uncertainty measure to show that event-based clustering outperforms other the entropy calculated for each instance x and each class clustering-based active learning approaches. Additionally, the effect of the number of labeled instances on the clas- 6 k is equal to the number of instances in the cluster. sifier performance is examined. In the second experiment, 9N Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts a) 1 user without noise b) 1 user with noise c) 5 users with noise 90 90 90 85 85 85 F1 Score F1 Score F1 Score 80 80 80 75 75 75 70 70 70 65 65 65 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 # of Instances # of Instances # of Instances (a) ground truth data (b) one annotator (c) five annotators e) 20 users with noise f) 50 users with noise h) 200 users with noise 90 90 90 85 85 85 F1 Score F1 Score F1 Score 80 80 80 75 75 75 70 70 70 65 65 65 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 # of Instances # of Instances # of Instances (d) 20 annotators (e) 50 annotators (f) 200 annotators Tang et al. Uncertainty Sampling Event-based Clustering Zhu et al. Figure 1. Evaluation results of state-of-the-art selection strategies and our approach. The graphs for different number of annotators (regular crowd users) are shown. Note that the more annotators labeled an instance the lower is the probability for a noisy labeling. the influence of noise is inspected. Noise results from a 6.2. Algorithms and Parameters low number of non-expert labelers. For keeping costs low In order to evaluate the performance of our approach, we a classifier should not be affected by a bad labeling. re-implemented the following related approaches: 6.1. Classification and Metrics (Tang et al., 2002): For initial sampling a k-means clus- tering is used. For query selection, first the most uncertain The active learning algorithms select instances from the instances for each cluster are selected. Then, information training set to query for labels. Based on these, a classi- density is used to weight the examples. We set k = 4, fier was trained and evaluated on the test set. As classifier because we have four different event types. we used Weka’s implementation of John Platt’s sequential minimal optimization (SMO) algorithm for training a sup- (Zhu et al., 2008): For initial sampling a k-means cluster- port vector machine (Platt, 1998). Due to the complexity of ing is used (k = 4). During the iterations, the entropy × determining best parameters for each iteration and each ap- density measure is used as selection criteria and no cluster- proach, we followed related approaches (see (Huang et al., ing is applied. 2010) and (Donmez et al., 2007)), and decided to compare Uncertainty: Random instances (initial) and the entropy- all algorithms on fixed parameters. Consequently, the SVM based uncertainty sampling (iterations) is used. was used with standard settings. Event-based clustering: Our event-based clustering is ap- For comparison, the deficiency metric (Raghavan et al., plied with a spatial extent of 200m and a temporal extent 2006) is calculated using the achieved F1 score of all it- of 20min.78 erations of a reference baseline algorithm (REF) and the compared active learning approach (AL). The result is nor- Following the experimental settings of (Hu et al., 2013) and malized using the largest F1 score and the learning curve (Huang et al., 2010), we set the size of the initial train- of the reference algorithm REF. Thus, the measure is non- ing set and the size during the iterations to 50. No further negative and values smaller than 1 indicate more efficient tuning or parameterization was applied. Each iteration for algorithms compared to the baseline strategy, whereas a 7 As a result, the 1,200 tweets of the training set are divided value larger than 1 indicates a performance decrease com- into 438 distinct event clusters. pared to the baseline strategy. 8 The spatial and temporal extent are a result of discussions with emergency managers. 8y Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts respective five annotators are shown. As can be seen in the Table 2. Deficiencies with Tang et al. as a baseline strategy. curve of the approach of Tang et al., the influence of noise is Approach Deficiency notable in the big drop with 500 instances. Also Zhu et al.’s (Tang et al., 2002) 1 approach has a much lower initial F1 score compared to Uncertainty Sampling 0.53 all others, which is an indicator for an inappropriate initial (Zhu et al., 2008) 0.90 selection strategy. The results indicate that even with noisy Event-based Clustering 0.44 labels, our approach outperforms the state-of-the-art as the situation in the graphs of the lower part of Figure 1 does each algorithm was repeated 10 times, as for instance, the not change much. In all these cases, the learning curves are uncertainty approach is highly dependent on the selected quite similar, which is a result of the decreased number of instances. We used averaged F1 based on the repetitions. wrongly labeled instances. Clearly, the performance of all approaches increases with a lower number of errors. 6.3. Comparison to state-of-the-art approaches As we showed, our approach outperforms related work also The performance graph for the ground truth data is shown if noise is taken into account. Not surprisingly, we found in Figure 1 (a). Note that the x-axis shows the total num- that with an increasing number of annotators, noise is neg- ber of instances combining the 50 instances of the initial ligible. With only one annotator, the deficiency is worse training set and 50 instances drawn per iteration. Thus, it- by 57% and with five annotators still worse by 26%. Even erations from 0 to 23 are depicted. As shown, the perfor- with 50 annotators, the deficiency still is worse by 10%. mance after selecting the initial training set is superior with For more than ten annotators, an F1 score of 85% is reached our approach. Also, in regions where only a few instances comparably fast. With a maximum of five annotators, this were labeled, the event-based clustering has a higher F1 level is only reached at the end of the simulation. For one value. This shows that a high-quality selection of the itera- annotator, this maximum is never achieved. These results tion instances is possible with our method. indicate that a minimum number of annotators is needed for achieving good results by crowdsourced labeling tasks. In Table 2 shows the deficiencies. With respect to the perfor- our experiments, ten annotators seem to be sufficient, while mance of the iterations, our approach has a decreased de- in other domains with different error rates, there might be ficiency compared to other clustering approaches (0.44 vs. a need for much more annotators. 0.53). The approach of Zhu et al. outperforms the approach of Tang et al. in most iterations and also with respect to the deficiency. We attribute this to the improved strategy for 7. Conclusion query selection. A surprising result is the performance of We presented an event-based clustering strategy for event uncertainty sampling that outperforms the other two clus- type classification of microposts and coped with several tering strategies. Apparently, only focusing on the infor- problems of active learning in the emergency management mativeness seems to be a good strategy for our dataset. In domain. First, it was shown that domain experts do not dif- contrast, using the number of distinct events as the number fer significantly from regular crowd users when it comes of clusters might not be the most efficient approach. to labeling quality. Second, we presented a novel selection The graph also shows that our approach has a steep learn- strategy for active learning based on temporal, spatial, and ing curve as for instance only a sixth of all instances are thematic information. Our event-based clustering that iden- needed to achieve a F1 score of about 84%. This is espe- tifies representative as well as informative instances outper- cially important when it comes to labeling costs, as only a forms state-of-the-art clustering approaches. On incident- limited amount of data would need to be labeled. One can related microposts we showed that a better initial training see a drop at 500 instances. This is most likely because set is selected as well as to appropriate instances for la- with more instances the number of clusters is decreasing, beling in each iteration are chosen. The learning curve thus, selecting appropriate instances is more difficult. indicated that only a sixth of all instances are needed to achieve a F1 score of about 84%, which is especially im- We can conclude that event-based clustering that takes rep- portant when it comes to labeling costs, as only a limited resentative as well as informative instances into account is amount of data would need to be labeled to achieve good a promising strategy for active learning. We also showed classification results. that our approach outperforms state-of-the-art for selecting an initial training set and for choosing appropriate instances In the future, we aim at using our active learning frame- for labeling in each iteration. work in addition to labeling of single features. Further- more, though our framework follows a general approach, Influence of noise in the labels In Figure 1 (b) and 1 (c), we only evaluated it on incident-related data, thus, we also the learning curves for the very error-prone cases with one want so show the applicability on other types of events. 8R Event-based Clustering for Reducing Labeling Costs of Incident-Related Microposts References Schulz, Axel, Ristoski, Petar, and Paulheim, Heiko. I see a car crash: Real-time detection of small scale incidents Donmez, Pinar and Carbonell, Jaime G. Proactive learning: in microblogs. In The Semantic Web: ESWC 2013 Satel- cost-sensitive active learning with multiple imperfect or- lite Events, volume 7955 of Lecture Notes in Computer acles. In CIKM’08, pp. 619–628, 2008. Science, pp. 22–33. Springer Berlin Heidelberg, 2013b. Donmez, Pinar, Carbonell, Jaime G., and Bennett, Paul N. Schulz, Axel, Guckelsberger, Christian, and Janssen, Fred- Dual strategy active learning. In ECML’07, pp. 116–127, erik. Semantic abstraction for generalization of tweet 2007. classification: An evaluation on incident-related tweets. In Semantic Web Journal: Special Issue on The Role of Hoi, Steven C. H., Jin, Rong, and Lyu, Michael R. Large- Semantics in Smart Cities (to appear). IOS Press, 2015. scale text categorization by batch mode active learning. In WWW’06, pp. 633–642, 2006. Settles, Burr. Active Learning. Synthesis Lectures on Ar- tificial Intelligence and Machine Learning. Morgan & Hsueh, Pei-Yun, Melville, Prem, and Sindhwani, Vikas. Claypool Publishers, 2012. Data quality from crowdsourcing: A study of annotation selection criteria. In NAACL HLT’09, pp. 27–35, 2009. Shen, Dan, Zhang, Jie, Su, Jian, Zhou, Guodong, and Tan, Chew-Lim. Multi-criteria-based active learning for Hu, Xia, Tang, Jiliang, Gao, Huiji, and Liu, Huan. Actnet: named entity recognition. In ACL’04, 2004. Active learning for networked texts in microblogging. In SIAM’13, 2013. Tang, Min, Luo, Xiaoqiang, and Roukos, Salim. Ac- tive learning for statistical natural language parsing. In Huang, Sheng-Jun, Jin, Rong, and Zhou, Zhi-Hua. Active ACL’02, pp. 120–127, 2002. learning by querying informative and representative ex- amples. In NIPS, pp. 892–900, 2010. Thongsuk, Chanattha, Haruechaiyasak, Choochart, and Meesad, Phayung. Classifying business types from twit- Kang, Jaeho, Ryu, Kwang R., and Kwon, Hyuk C. Using ter posts using active learning. In I2CS’10, pp. 180–189, Cluster-Based Sampling to Select Initial Training Set for 2010. Active Learning in Text Classification. In PAKDD’04, Tong, Simon and Koller, Daphne. Support vector machine pp. 384–388, 2004. active learning with applications to text classification. J. Lewis, David D. and Catlett, Jason. Heterogeneous uncer- Mach. Learn. Res., 2:45–66, 2002. tainty sampling for supervised learning. In ICML’94, pp. Wallace, Byron C., Small, Kevin, Brodley, Carla E., and 148–156, 1994. Trikalinos, Thomas A. Who should label what? instance Nguyen, Hieu T. and Smeulders, Arnold. Active learning allocation in multiple expert active learning. In SDM’11, using pre-clustering. In ICML’04, pp. 79–86, 2004. 2011. Zhao, Liyue, Sukthankar, G., and Sukthankar, R. In- Platt, J. Fast training of support vector machines using cremental Relabeling for Active Learning with Noisy sequential minimal optimization. In Schoelkopf, B., Crowdsourced Annotations. In SocialCom’11, pp. 728– Burges, C., and Smola, A. (eds.), Advances in Kernel 733, 2011. Methods - Support Vector Learning. MIT Press, 1998. Zhu, Jingbo, Wang, Huizhen, Yao, Tianshun, and Tsou, Raghavan, Hema et al. Active learning with feedback on Benjamin K. Active learning with sampling by uncer- both features and instances. J. of Machine Learning Re- tainty and density for word sense disambiguation and search, 7:1655–1686, 2006. text classification. In COLING’08, pp. 1137–1144, 2008. Schulz, Axel. Mining User-Generated Content for Inci- dents. PhD thesis, TU Darmstadt, 2014. URL http: //tuprints.ulb.tu-darmstadt.de/4107/. Schulz, Axel, Hadjakos, Aristotelis, Paulheim, Heiko, Nachtwey, Johannes, and Mühlhäuser, Max. A multi- indicator approach for geolocalization of tweets. In Pro- ceedings of the Eight International Conference on We- blogs and Social Media (ICWSM), pp. 573–582, Menlo Park, California, USA, 2013a. AAAI Press. 8k