=Paper=
{{Paper
|id=Vol-1141/paper_01
|storemode=property
|title=Evaluating Multi-label Classification of Incident-related Tweet
|pdfUrl=https://ceur-ws.org/Vol-1141/paper_01.pdf
|volume=Vol-1141
|dblpUrl=https://dblp.org/rec/conf/msm/SchulzMDS14
}}
==Evaluating Multi-label Classification of Incident-related Tweet==
Evaluating Multi-label Classification of Incident-related Tweets Axel Schulz*+ Eneldo Loza Mencía+ Thanh Tung Dang† Benedikt Schmidt* Telecooperation Lab * Knowledge Engineering Group † HCI Research + Technische Universität Darmstadt Technische Universität Darmstadt SAP AG, Darmstadt Germany Germany Germany {aschulz,benedikt.schmidt}@tk.informatik.tu-darmstadt.de eneldo@ke.tu-darmstadt.de thanh.tung.dang@sap.com ABSTRACT such as incidents. In the latter case, citizens act as observers Microblogs are an important source of information in emer- and create valuable incident-related information. For in- gency management as lots of situational information is stance, during incidents such as the Oklahoma grass fires shared, both by citizens and official sources. It has been and the Red River floods in April 2009 [29], or the terror- shown that incident-related information can be identified ist attacks on Mumbai [4], useful situational information was in the huge amount of available information using machine shared on Twitter. Also, Ushahidi, a social platform used for learning. Nevertheless, the currently used classification tech- crowd-based filtering of information [15], was heavily used niques only assign a single label to a micropost, resulting in during the Haitian earthquake for labeling crisis-related in- a loss of important information that would be valuable for formation. crisis management. However, the discovery of incident-related information is With this paper we contribute the first in-depth analysis a complex task, requiring the separation of valuable infor- of multi-label classification of incident-related tweets. We mation from daily chatter in the vast amount of information present an approach assigning multiple labels to these mes- created on social platforms. This can be realized based on sages, providing additional information about the situation techniques from data mining and machine learning. Clas- at-hand. An evaluation shows that multi-label classifica- sification is one method which can be utilized to extract tion is applicable for detecting multiple labels with an exact relevant information from social networks (for tweets, see match of 84.35%. Thus, it is a valuable means for classifying [23]). In a classification task, a system learns to label mes- incident-related tweets. Furthermore, we show that correla- sages with exactly one label out of a predefined label set tion between labels can be taken into account for these kinds (e.g., ”fire” or ”crash”). This task is known as multi-class of classification tasks. classification and widely used for text classification. How- ever, during our research we found that assigning only one label would result in the loss of important situational in- Categories and Subject Descriptors formation for decision making in crisis management. For H.3.3 [Information Storage and Retrieval]: Infor- instance, consider the following tweet: mation Search and Retrieval—Information filtering; I.2.6 [Artificial Intelligence]: Learning THIS CAR HIT THE FIRE HYDRANT AND CAUGHT FIRE....SOMEONE HOLIDAY AL- General Terms TERED Algorithms, Experimentation A single label would necessarily lack relevant information. A better approach is the concurrent assignment of all three Keywords labels, which is known as multi-label learning. In the ex- Microblogs, Multi-label Learning, Social Media ample, all labels (”fire”, ”crash”, and ”injuries”) would be assigned concurrently using an appropriate learning algo- 1. INTRODUCTION rithm. The example also shows that the assignment of mul- Social media platforms are widely used by citizens for tiple labels is not necessarily an independent process. Once sharing information covering personal opinions about vari- the label for an incident type such as ”crash” is assigned ous topics (e.g., politics) as well as information about events the probability of assigning the label ”injuries” is changing. This dependency is known as label correlation and needs to be investigated in the context of multi-label learning. With our analysis we want to investigate three important aspects of applying multi-label learning on incident-related Copyright c 2014 held by author(s)/owner(s); copying permitted tweets: (1) how to apply multi-label learners on tweets, (2) only for private and academic purposes. if the classification accuracy of multi-label classification ap- Copyright Published isasheld partbyofthe theInternational World Wide #Microposts2014 Web Conference Workshop Com- proceedings, proaches is comparable to the accuracy of multi-class clas- available online as CEUR Vol-1141 (http://ceur-ws.org/Vol-1141) mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the sification approaches, and (3) if correlation between labels author’s site if the Material is used in electronic media. #Microposts2014, April 7th, 2014, Seoul, Korea. WWW’14 Companion, April 7–11, 2014, Seoul, Korea. is a factor that needs to be taken into account for incident- ACM 978-1-4503-2745-9/14/04.. related information. With this paper we contribute the first · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 in-depth analysis of multi-label classification of incident- [1]. Sajnani et al. provided a preliminary analysis of multi- related tweets. In summary, our contributions are twofold: label classification of Wikipedia barnstar texts. Barnstars can be awarded by Wikipedia authors and contain a short • We show that multi-label classification on incident- textual explanation why they have been awarded. In this related tweets is applicable and able to detect the exact case, labels for seven work domains have to be differenti- combinations of labels in 84.35% of the cases. Thus, we ated. The authors show which features can be extracted show that compared to common multi-class classifica- from short texts for multi-label classification and evaluate tion approaches, multi-label classification of incident- several multi-label classification approaches. Daxenberger related tweets is a valuable means. et al. categorize individual edits into non-exclusive classes like vandalism, paraphrase, etc. • We evaluate the influence of label correlation on the Summarized, although many related approaches cope with classification results of incident-related tweets. We multi-class classification of short texts such as microblogs, show that for classification tasks label correlation multi-label classification is an open research issue. Espe- needs to be taken into account. cially for the domain of crisis management, no prior research on this topic exists. The remainder of the paper is organized as follows. First, we describe and discuss related approaches. Second, the considered multi-label classification algorithms as well as the 3. MULTI-LABEL CLASSIFICATION technical infrastructure (a machine learning pipeline) used In this section, we give an overview on multi-label for the analysis are presented. Next, we introduce our data classification. Multi-label classification refers to the collection setup and describe the evaluation of our approach. task of learning a function that maps instances xi = We close with a conclusion and future work. (xi,1 , . . . , xi,a ) ∈ X ⊆ Ra to label subsets or label vectors yi = (yi,1 , . . . , yi,n ) ∈ {0, 1}n , where L = {λ1 , . . . , λn }, n = |L| is a finite set of predefined labels and where each 2. RELATED WORK label attribute yi corresponds to the absence (0) or presence Techniques of multi-label classification have been applied (1) of label λi . Thus, in contrast to multi-class classifica- to domains such as text categorization [21, 13], music genre tion, alternatives are not assumed to be mutually exclusive, detection [20], or tag recommendation [7]. These applica- such that multiple labels may be associated with a single tion domains address long texts, images, or audio informa- instance. tion. Text is probably one of the oldest domains in which This makes multi-label data particularly interesting from the demand for categorization appeared, particularly multi- the learning perspective, since, in contrast to binary or label categorization [25], with the first multilabel dataset multi-class classification, there are label dependencies and (Reuters-21578 ) used in machine learning research being interconnections in the data which can be detected and ex- from the year 1987 [5, 8, 9]. Moreover, data is easily ac- ploited in order to obtain additional useful information or cessible and processable as well as vastly available. Hence, just better classification performance. Some examples for text classification was also one of the first research fields our particular Twitter dataset were already shown up in the for multi-label classification and continues to be the most introduction. As we show, around 15% of our tweets could represented one among the commonly available benchmark be assigned to more than one label, thus, we believe that it is datasets.1 not unusual to encounter tweets with several possible labels, A common application for texts is the classification of so that in our opinion the view of microblogs as multi-labeled news articles [10, 18] for which the research focuses on scal- data seems more natural, more realistic, and more general. ability issues regarding the number of articles and especially Nonetheless, previous work focuses on the multi-class label- the number of labels a text can be assigned to, which can ing of tweets and this is the first work known to the authors sometimes go up to the thousands [11, 26]. News texts, which tries to exploit label dependencies on tweets. as well as abstracts from scientific papers [14] or radiology In the following, we will describe commonly used ap- reports [16] may sometimes be relatively short, but they proaches for multi-label classification: Binary Relevance are usually still structured and homogeneous. This kind of (BR), Label Powerset (LP), and Classifier Chains (CC). multi-label text classification problems were very well an- All described techniques are based on the decomposition alyzed in the past and the used approaches showed to be or transformation of the original multi-label problem into effective (we refer the interested reader to the cited recent single-label binary problems, as most multi-label techniques works). do [27]. An illustration of these techniques is presented in In contrast, texts such as tweets are mostly unstructured Figure 1. This has the advantage that we can use state-of- and noisy, because of their limitations in size and the often the-art text classification algorithms for learning the binary used colloquial language. Related work on such short texts problems such as support vector machines [25, 6]. We will with a focus on solving multi-class problems exists, e.g., for also have a closer look at each classification approach with sentiment analysis [24] or incident detection and classifica- respect to taking dependencies between labels into account. tion [23]. In contrast to these approaches, this paper focuses Two of the used approaches are specifically tailored in order on the use of multi-label classification for tweets. to cope with such dependencies. Applying multi-label learning on very short texts is a topic of open research. Only two respective examples are known 3.1 Binary Relevance to the authors: Sajnani et al. [19] and Daxenberger et al. The most common approach for multi-label classification 1 Cf. http://mulan.sourceforge.net/datasets.html [28] is to use an ensemble of binary classifiers, where each clas- and http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ sifier predicts if an instance belongs to one specific class or datasets/multilabel.html repositories. not. The union of all classes that were predicted is taken 27 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 to predicting correctly each label individually (BR), i.e. we xi Labels ∈ {0, 1}n xi Class ∈ {1, . . . , 2n } usually have a trade-off between both objectives. x1 (y1,1 , . . . , y1,n ) x1 σ(y1 ) x2 (y2,1 , . . . , y2,n ) x2 σ(y2 ) In addition to the obvious computational costs problem .. due to the exponential grow of meta-labels, the sparsity of . ... .. . ... .. . .. . some label combinations, especially with an increasing num- (a) input training set (b) label powerset (LP) de- ber of labels, often causes that some classes contain only few composition examples. This effect can also be observed in our data, cf. Table 2. xi Class1 ∈ {0, 1} xi Classn ∈ {0, 1} x1 x2 y1,1 y2,1 ··· x1 x2 y1,n y2,n 3.3 Classifier Chains .. .. .. .. As stated before in Section 1, it is very likely in our dataset . . . . that injured people are mentioned when also any incident (c) binary relevance (BR) decomposition type is mentioned (200 of 967 cases). On the other hand, it seems almost a matter of course that there was an incident x0i Class1 ∈ {0, 1} x0i ∈ Ra × {0, 1}n−1 Classn ∈ {0, 1} if there is an injured person. Although this only happens x1 y1,1 (x1 , y1,1 , . . . , y1,n−1 ) y1,n in 200 out of 232 cases in our data we consider it relevant x2 y2,1 ··· (x2 , y2,1 , . . . , y2,n−1 ) y2,n .. .. .. .. for larger data sets. The classifier chains approach (CC) of . . . . Read et al. [17] is able to directly capture such dependencies (d) classifier chains (CC) decomposition and has therefore become very popular recently. The idea of this approach is to construct a chain of n Figure 1: Decomposition of multi-label training sets binary classifiers hCC j , for which (in contrast to BR) each into multiclass (LP) or binary (BR, CC) problems. binary base classifier hCC j depends on the predictions of x0i denotes the augmented instance. During predic- the previous classifiers hCC 1 . . . hCC j−1 . Particularly, we ex- tion, yi,1 , yi,2 , . . . in the extended input space is re- tent the feature space of the training instances for the base placed by the predictions by hCC CC 1 , h2 , . . . (see text). classifier hCC j to ((xi,1 . . . xi,a , yi,1 . . . yi,j−1 ), yi,j ). Since the true labels yi are not known during prediction, CC uses the predictions of the preceding base classifiers instead. as the multi-label output. This approach is comparable to Hence, the unknown yj are replaced by the predictions classical one-against-all for a multi-class problem. Formally, ŷj = hCC j (x, ŷ1 . . . ŷj−1 ). we convert a training example pair (xi , yi ) into n separate This shows up one problematic aspect of this approach, pairs (xi , yi,j ), j = 1 . . . n, one for each of the n base clas- namely the order of the classifiers in the chain. Depending sifiers hj . The predicted labels ŷj for a test instance x are on the ordering, CC can only capture one direction of depen- then the result of hj (x) ∈ {0, 1}. dency between two labels. More specifically, CC can only This method is fast and simple, however, it is not able to capture the dependencies of yi on y1 , . . . , yi−1 , but there is take label dependencies into account since each base classi- no possibility to consider dependencies of yi on yi+1 , . . . , yn . fier is trained independently from the other classifiers. As Recovering our example from the beginning, we can either was recently stated by Dembczynski et. al [2], this is not learn the dependency of the label incident given injury or necessarily a disadvantage if the objective is to obtain good the other way around, but not both. In addition, the ef- label-wise predictions, such as measured by the Hamming fect of error propagation caused by the chaining structure loss (cf. Section 5). Therefore, BR serves as a fairly good may also depend on the label permutation. We will evaluate performing baseline for our experiments. the effect of choosing different orderings for our particular dataset later on in Section 5.3. Furthermore, CC has advantages compared to LP. CC is considered to predict the correct label-set, such as LP [2], but unlike LP, CC is able to predict label combinations 3.2 Label Powerset which were not seen beforehand in the training data. In ad- The basic idea of this algorithm is to transform multi- dition, the imbalance between positive and negative training label problems into a multi-class classification problem by examples is generally lower than for LP. considering each member of the powerset of labels in the training set as a single class. Hence, each training example 4. MULTI-LABEL CLASSIFICATION OF is converted into (xi , σ(yi )) with σ, σ −1 denoting a bijective function that maps between the label powerset of L and a INCIDENT-RELATED TWEETS set of 2n meta-classes. The classifier hLP is trained e.g. with In the following, the data used for multi-label classifica- one-against-all (like in our setting), and the prediction for x tion of incident-related tweets is described in detail. The is obtained with σ −1 (hLP (x)). taken approach is composed of three steps. As a first step, LP takes label dependencies into account to some extent, unstructured text has to be converted into structured text. as each distinct occurrence of a label pattern is treated as As a second step, the structured information needs to be a new class. It is hence able to model the joint label distri- transformed to features that can be used by a multi-label bution, but not explicitly and directly specific dependencies learner. Third, these features are used to train and evaluate (correlations, implications, etc.) between labels. As a conse- a classifier. quence, LP is tailored towards predicting exactly the correct label combination. As it is pointed out in [2] and contrary to 4.1 Preprocessing of Unstructured Text what one may believe at first, this stays usually in contrast Our overall goal is to apply text mining on short docu- 28 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 ments that are present in social media, thus, they need to the number of ’ !’ and ’ ?’ in a tweet and the number be represented by a set of features. As texts in social media of capitalized characters. are mostly unstructured, they first need to be converted into a representation which enables feature generation. Hence, as • Spatial features: As location mentions are replaced a first step, we apply Natural Language Processing. Firstly, with a corresponding token, they appear as word uni- we remove all re-tweets as these are just duplicates of other grams in our model and can therefore be regarded as tweets and do not provide additional information. Secondly, additional features. @-mentions of Twitter users are removed from the tweet message as we want to prevent overfitting towards certain 4.3 Dataset user tokens. Before further processing is applied, the text is We focus on three different incident types throughout the converted to Unicode, as some tweets contain non-Unicode paper in order to differentiate incident-related tweets. Three characters. Third, abbreviations are resolved using a dic- classes have been chosen, because we identified them as the tionary of abbreviations based on the data provided by the most common incident types using the Seattle Real Time Internet Slang Dictionary&Translator2 . Then, we identify Fire Calls dataset3 , which is a frequently updated source and replace URLs with a common token ”URL”. As a next for official incident information. We included also injury as step, stopwords are removed. This is important as very fre- an additional label. This results in four labels consisting quent words have limited influence when it comes to classi- of very common and distinct incident types and the injury fying tweets due to their relative frequency. Based on the label: Fire, Shooting, Crash, and Injury. resulting text, we conduct tokenization. Thus, the text is We collected public tweets in English language using the divided into discrete words (tokens) based on different de- Twitter Search API, which provides geotagged tweets as well limiters such as white spaces. Every token is then analyzed as tweets for which Twitter inferred a geolocation based on and non-alphanumeric characters are removed or replaced. the user profile. For the collection, we used a 15km radius Also, lemmatization is applied to normalize all tokens. Ad- around the city centers of Seattle, WA and Memphis, TN. ditionally to the common NLP processing steps, we identify We focused on only two cities, as for our analyses we are and replace location mentions such as ”Seattle” with a com- interested in the stream of tweets for these cities and a spe- mon token to allow semantic abstraction. For this, we use cific time period instead of a scattered sample of the world, the approach presented in [23] to detect named entities re- which could be retrieved using the Twitter Streaming API. ferring to locations (so-called location mentions) in tweets This gave us a set of 7.5M tweets collected from 11/19/12 and to replace them with two tokens ”LOC” and ”PLACE”. to 02/07/13. Though we know about the limitations of the 4.2 Feature Generation Search API, we think that we collected a relevant sample for After finishing the initial preprocessing steps, we ex- our experiments. tracted several features from the tweets that are used for The dataset was further reduced to be usable for high training a classifier. We conducted a comprehensive feature quality labeling as well as the machine learning experiment. selection, analyzing the value of each feature for the over- We first identified and extracted tweets mentioning incident- all classification performance. We compared word-n-grams, related keywords. Compared to other approaches that com- char-n-grams, TF-IDF [12] scores as well as syntactic fea- pletely rely on filtering using hashtags, we take the whole tures such as the number of explanation marks, question message into account for identifying incident-related key- marks, and upper case characters. We found that the fol- words. We retrieved a set of different incident types using lowing features are the most beneficial for our classification the ”Seattle Real Time Fire 911 Calls” dataset and defined problems: one general keyword set with keywords that are used in all types of incidents like ’incident’, ’injury’, ’police’, etc. • Word 3-gram extraction: We extract word three-grams For each incident type, we further identified specific key- from the tweet message. Each 3-gram is represented words. For instance, for the incident type ’Motor Vehicle by two attributes. One attribute indicating the pres- Accident Freeway’ we use the keywords ’vehicle’, ’accident’, ence of the 3-gram and another attribute indicating and ’road’. Based on these words, we use WordNet4 to ex- the frequency of the 3-gram. tend this set by adding the direct hyponyms. For instance, the keyword ’accident’ was extended with ’collision’, ’crash’, • Sum of TF-IDF scores: For every document we calcu- ’wreck’, ’injury’, ’fatal accident’, and ’casualty’. Based on late the accumulated TF-IDF (term-frequency inverse- these incident-related keywords, we filtered the datasets. document-frequency) score [12] based on the single TF- Furthermore, we removed all re-tweets, as the originated IDF scores of each term in the document. The rational tweets are also contained in our datasets and only these are behind this is to create a similarity score which is not needed for our experiments. Based on this filtered dataset, as strict as traditional TF-IDF scores, but allows form- we randomly selected 20.000 tweets. ing of clusters of similar documents. The selected tweets have been labeled manually by one researcher of our department. Out of these tweets, we ran- • Syntactic features: Along with the features directly domly selected 2.000 tweets for further re-labeling for our extracted from a tweet, several syntactic features are multi-label classification problem. Those tweets were manu- expected to improve the performance of our approach. ally examined by five researchers using an online survey. To People might tend to use a lot of punctuations, such assign the final coding, we differentiated between two types as explanation marks and question marks, or a lot of of agreement: capitalized letters when they are reporting some inci- dent. In this case, we extract the following features: 3 http://data.seattle.gov 2 4 http://www.noslang.com http://wordnet.princeton.edu 29 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 Table 1: Overview of real-world incident types used for extraction of incident-related keywords as well as and the number of extracted keywords for keyword-based classification approach. Class Fire Shooting Crash Injury Real-World Incident Fire In Building Assault w/Weap Motor Vehicle Accident - Type Fire In Single Family Res Assault w/Weapons Aid Motor Vehicle Accident Freeway Automatic Fire Alarm Resd Medic Response Freeway Auto Fire Alarm Car Fire Car Fire Freeway # of Keywords 148 36 73 23 5. EVALUATION Table 2: Distribution of the 10 label combinations occurring in the 2000 tweets of the dataset. In the following section, we provide the evaluation results Label Combination Number of Tweets for the presented multi-label classification approaches on our {} 971 dataset. We also present the result for a keyword-based ap- proach as a simple way for conducting multi-label classifica- {Fire} 313 tion. {Shooting} 184 {Crash} 268 5.1 Evaluation Setup {Injury} 32 We performed our experiments with Mulan, an open- {Crash, Fire} 2 source library for multi-label classification based on Weka {Injury, Crash} 47 [28]. We used two learners for our evaluation. First, we use {Injury, Shooting} 149 the LibLinear implementation of support vector machines {Injury, Fire} 33 with linear kernel [3] as our base learner. We use the default {Injury, Fire, Crash} 1 settings, as we found that additional parameter optimization was not beneficial for improving the overall classification re- sults. Second, we used the Weka implementation of Naive • if four out of five coders agree on one label, only this Bayes. The results were obtained using 10-fold cross valida- label is assigned tion. The evaluation of multi-label problems requires different • if less than four coders agree on one label, all labels measures compared to those used for multi-class problems. which at least two coders assumed as correct are as- In our paper, we use the following metrics: signed as possible labels and further verified in a group Exact Match: Exact match is the percentage of the m discussion test instances for which the labelsets were exactly correctly classified (with [[z]] as indicator function returning 1 if z is The final labeled dataset consists of 10 different label com- true, otherwise 0) binations. The distribution for every combination is outlined m in Table 2. The distribution indicates that around 15% (232) 1 X of all tweets in our dataset have been labeled with multiple ExactM atch(h) = [[yi = h(xi )]] (1) m i=1 labels. Another observation is that almost exactly 50% of the tweets do not have any label assigned, which is rather un- Hamming Loss: The instance-wise Hamming loss [22] usual compared to typically used and analyzed multi-label is defined as the percentage of wrong or missed labels com- datasets5 . In addition, the label cardinality, i.e., the av- pared to the total number of labels in the dataset. In this erage number of labels assigned to an instance, is around case, it is taken into account that an incorrect label is pre- 0.59, whereas common datasets have at least more than 1 dicted and that a relevant label is not predicted. As this is assigned. On the other hand, this is mainly due to the low a loss function, the optimal value is zero. number of total labels, since the label density (the aver- Recall, Precision and F1: We use micro-averaged age percentage of labels which are true) is 15%, which is a precision and recall measures to evaluate our results, i.e., relatively high value. From a multi-label learning perspec- we compute a two-class confusion matrix for each label tive, this is an interesting property of this dataset since it (yi = 1 vs. yi = 0) and eventually aggregate the results is not clear how commonly used techniques will behave un- by (component-wise) summing up all n matrices into one der this circumstance. For example, many algorithms ignore global confusion matrix (cf. [27]). Recall and precision is instances without any label given. computed based on this global matrix in the usual way, F1 denotes the unweighted harmonic mean between precision and recall. In Section 5, we also report recall, precision and F1 for each label using the label-wise confusion matrices. 5 We refer to the repository at http://mulan.sourceforge. net/datasets.html for an overview of the statistics of the 5.2 Results for Keyword-Based Filtering commonly used benchmark datasets in multi-label classifi- As mentioned before, we use a keyword-based pre-filtering cation for selecting an initial set of tweets that is suitable for la- 30 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 84,60% 84,40% 84,20% 84,00% Exact Match 83,80% 83,60% 83,40% 83,20% 83,00% 82,80% 82,60% Label Combinations Figure 2: Percentages of exact matches for all label combinations. beling. A first and simple approach for detecting incident be explained on the one hand by the generally good individ- related tweets is to use these keywords for classification. ual prediction performance for Shooting (cf., Table 5), hence In Table 1, the real-world incident types from the Seattle leading to low error propagation, and on the other hand by Real Time Fire Calls dataset and the corresponding number the resulting label dependencies given the Shooting label is of extracted keywords is shown. For the injury class, no known: for instance, we can see from Table 2 that we can specific type in the Seattle dataset could be found, thus, we safely exclude Crash or Fire if there was a Shooting. This extended the set with a manually created list of keywords shows that our initial assumption that correlation between and their direct hyponyms. labels needs to be taken into account is true. The results for classifying each individual class are shown Based on the respective best (MAX) and the worst se- in Table 3. The results indicate that precision as well as quence (MIN), we compared CC to the multi-label ap- recall are rather low. Only for the fire class a high recall proaches with the two different base learners. In Table 4 could be achieved. these evaluation results are shown. The first observation is that Naive Bayes is not adequate for classifying tweets, since though it achieves the best recall values using CC, this Table 3: Precision and recall for each individual la- is in exchange of very low results on the remaining metrics bel when applying keyword-based classification. and approaches. We will therefore focus on the results ob- Shooting Fire Crash Injury tained by applying LibLinear as base learner. The results Precision 31.59% 54.12% 15.04% 63.64% show that, if there is the opportunity of pre-optimizing the Recall 68.77% 95.99% 49.37% 37.40% ordering of the labels, e.g., by performing a cross-validation on the training data, then classifier chains is able to slightly Furthermore, if the keywords would be used for applying outperform the other approaches, which is most likely be- multi-label classification, a precision of 32.22% and a recall cause the label correlation is valuable. This is also reflected of 64.90% is achieved, which is a rather bad result. Also in the good performance with respect to exact match, where exact match (28.45%) and h-loss (27.08%) are bad, thus, we the worst CC even outperforms LP, which is particularly tai- conclude that with simple keyword-based filtering, multi- lored towards matching the exact label combination. Note label classification cannot be done accurately. also that LP is a common approach used for circumventing the need for a multi-label classification by creating meta- 5.3 Results for Multi-Label Classification classes, as already mentioned in the introduction. However, As a first step, we coped with the question if correlation this approach is always inferior to the compared approaches, between labels is taken into account and beneficial for the which demonstrates the need for more advanced techniques classification results. Thus, we evaluated all different label in this particular use case. sequences using the classifier chains algorithm for our labels We can also observe that improving the prediction of the Fire (F), Shooting (S), Crash (C), and Injury (I). The values exact label combinations may come at the expense of re- for exact match for each sequence are shown in Figure 2 ducing the performance on label-wise measures, since the (using SVM as our base learner). additional features used by CC generally lead to a higher The results indicate that the label sequence has indeed potential deterioration (MIN) than potential improvement an influence on the classification performance. In our case, (MAX) for Hamming loss, recall, precision and F1, whereas we get a difference of 1% between the best sequence Shoot- for exact match this is not as clear. ing, Crash, Fire, Injury and the worst Injury, Crash, Fire, As a last evaluation step, we evaluated the accuracy of Shooting. Also, we see that the Injury label is best used each approach for every individual label. This is important after incident labels have been classified - for the best cases as we want to understand how well a classifier performs for even as one of the last labels in the sequence. It is also re- each label. The following Table 5 depicts the accuracy of markable that classifying Shooting as first label followed up individual labels using SVM with the best label order. by either Crash or Fire is always a good option. This can 31 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 Table 4: Results for the different multi-label approaches and base learners obtained by cross-validation. Naive Bayes SVM BR LP CC - MIN CC - MAX BR LP CC - MIN CC - MAX Exact Match 59.60% 66.95% 71.15% 72.45% 83.85% 83.05% 83.25% 84.35% H-Loss 15.02% 14.08% 9.400% 9.175% 4.688% 5.313% 4.900% 4.588% F1 52.19% 55.37% 72.90% 73.61% 83.55% 81.53% 82.80% 84.02% Precision 52.40% 55.34% 66.84% 67.92% 93.61% 90.28% 92.75% 93.46% Recall 51.98% 55.39% 79.63% 80.35% 75.44% 74.35% 74.72% 76.47% The following misclassified tweets show examples for such Table 5: Precision and recall for each individual la- wrongly classified instances: bel. BR (SVM) LP (SVM) CC (SVM) ”Tacoma Fire Department replaces 3 fire engines with pickup trucks: TACOMA Prec. Recall Prec. Recall Prec. Recall Cutbacks within the Tacoma Fire... Shooting 95.7% 79.3% 92.0% 76.9% 95.7% 79.3% http://t.co/jPe2kuKG” ( {} -> {F} ) Fire 94.7% 82.0% 90.3% 83.0% 93.3% 83.7% Crash 90.8% 77.4% 88.0% 78.3% 90.9% 78.3% ”This girl is on fire. This girl is on fire. She’s walking on fire. This girl is on fire - Injury 92.9% 59.5% 91.1% 54.6% 93.0% 61.0% Alicia Keys #deep”, ( {} -> {S} ) ”NeoMemphis News: Massive fire at fac- tory in Ripley: Action News 5 is on the scene The results show that the precision for individual labels is of a factory fire at ... http://t.co/brfnVbWp high with about 90% to 95% for each label, which is much #memphis”, ( {F} -> {F,I} ) better compared to the keyword-based classification. The differences between all approaches are nearly the same, thus, all approaches seem to be appropriate for classifying the The examples show that certain words such as ”fire” or individual labels. However, the recall drops significantly, digits in the message might lead to wrong classifications. depending on the label type. For instance, injuries often This could be avoided by adding additional features or with remain undetected. In this case, classifier chains show the a larger training set. best results for precision and recall. Note that the results for In this section we have first shown that a simple keyword- BR and CC on Shooting are the same, since the first classifier based classification approach is not suitable for multi-label in the CC ordering is exactly trained like the corresponding classification. Second, we presented results of state-of- BR classifier (cf. also Figure 1). This also shows that along the-art multi-label classification approaches and we showed the chain, CC slightly reduces the good precision of BR in that these perform quite well for classifying incident-related exchange of improved recall. tweets. Compared to current approaches for the classifica- tion of microblogs, which rely on assigning only one label to an instance, the results show that it is possible to infer im- 5.4 Discussion portant situational information with only one classification Though the results show the advantage of multi-label clas- step. The results also indicate that the label sequence has an sification, we want to understand the limitations of our ap- influence on the classification performance, thus, this factor proach. Thus, we first created a confusion matrix for the should be taken into account for following approaches. classifier chains approach with the best label order. The matrix shows that most misclassifications occur due to an assignment of instances to the ”no incident” label combina- 6. CONCLUSION tion {}. The other wrong classifications are mostly a result In this paper we have shown how to apply multi-label of not detecting the injury label or of predicting it wrongly. learning on social media data for classification of incident- related tweets. Furthermore, we analyzed that we are able to identify multiple labels with an exact match of 84.35%. This is an important finding, as multiple labels assigned with one Table 6: Confusion matrix. The rows indicate the classification approach provide important information about predicted/true label combinations and the columns the situation at-hand, which could not be easily derived from the true/predicted ones. previously used multi-class classification approaches. Fur- ∅ F C F,C I F,I C,I F,C,I S F,S I,S thermore, we have shown that the natural relation of labels, ∅ 924 16 24 0 0 0 0 0 3 0 4 which represents for instance the relation between incidents F 49 261 0 0 0 3 0 0 0 0 0 C 54 0 213 1 0 0 0 0 0 0 0 and injuries in the real-world, can be used and exploited by F,C 1 1 0 0 0 0 0 0 0 0 0 classification approaches in order to obtain better results. I 16 0 1 0 11 0 0 0 1 0 3 For future work, we aim to add costs to our classifications. F,I 5 10 0 0 1 17 0 0 0 0 0 C,I 8 0 12 0 3 0 23 0 0 0 1 For instance, not detecting incident labels should be heavily F,C,I 1 0 0 0 0 0 0 0 0 0 0 punished compared to misclassifying the incident type. Fur- S 33 4 0 0 1 0 0 0 142 0 4 thermore, we aim to improve the overall performance of our F,S 0 0 0 0 0 0 0 0 0 0 0 approach by taking different features and a larger training I,S 26 0 0 0 5 0 0 0 22 0 96 set into account. 32 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 Acknowledgements [15] O. Okolloh. Ushahidi, or ’testimony’: Web 2.0 tools for crowdsourcing crisis information. Participatory Learning This work has been partly funded by the German Federal and Action, 59(January):65–70, 2008. Ministry for Education and Research (BMBF, 01|S12054). [16] J. P. Pestian, C. Brew, P. Matykiewicz, D. Hovermale, References N. Johnson, K. B. Cohen, and W. Duch. A shared task involving multi-label classification of clinical free text. In [1] J. Daxenberger and I. Gurevych. A corpus-based study of Proceedings of the Workshop on BioNLP 2007: Biological, edit categories in featured and non-featured wikipedia arti- Translational, and Clinical Language Processing, pages 97– cles. In Proceedings of the 24th International Conference on 104. Association of Computational Linguistics, June 2007. Computational Linguistics (COLING 2012), pages 711–726, Dec. 2012. [17] J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classi- fier chains for multi-label classification. Machine Learning, [2] K. Dembczynski, W. Waegeman, W. Cheng, and E. Hüller- 85(3):333–359, June 2011. meier. On label dependence and loss minimization in multi- label classification. Machine Learning, 88(1-2):5–45, 2012. [18] T. N. Rubin, A. Chambers, P. Smyth, and M. Steyvers. Sta- tistical topic models for multi-label document classification. [3] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Machine Learning, 88(1-2):157–208, 2012. Lin. Liblinear: A library for large linear classification. Jour- nal of Machine Learning Research, 9:1871–1874, aug 2008. [19] H. Sajnani, S. Javanmardi, D. W. McDonald, and C. V. Lopes. Multi-label classification of short text: A study on [4] R. Goolsby. Lifting Elephants: Twitter and Blogging in wikipedia barnstars. In Analyzing Microtext, 2011. Global Perspective. In Social Computing and Behavioral Modeling. 2009. [20] C. Sanden and J. Z. Zhang. Enhancing multi-label music genre classification through ensemble techniques. In Proceed- [5] P. J. Hayes and S. P. Weinstein. CONSTRUE/TIS: A system ings of the 34th international ACM SIGIR conference on for content-based indexing of a database of news stories. In Research and development in Information Retrieval, pages A. T. Rappaport and R. G. Smith, editors, Proceedings of 705–714. ACM, 2011. the 2nd Conference on Innovative Applications of Artificial Intelligence (IAAI-90), May 1-3, 1990, Washington, DC, [21] R. E. Schapire and Y. Singer. Boostexter: A boosting- USA, IAAI ’90, pages 49–64. AAAI Press, Chicago, IL, USA, based system for text categorization. Machine learning, 1991. 39(2-3):135–168, 2000. [6] T. Joachims. Text categorization with support vector ma- [22] R. E. Schapire and Y. Singer. BoosTexter: A Boosting- chines: Learning with many relevant features. In C. Nédel- based System for Text Categorization. Machine Learning, lec and C. Rouveirol, editors, Proceedings of 10th European 39(2/3):135–168, 2000. Conference on Machine Learning (ECML-98), pages 137– 142, Chemnitz, Germany, 1998. Springer-Verlag. [23] A. Schulz, P. Ristoski, and H. Paulheim. I see a car crash: Real-time detection of small scale incidents in microblogs. [7] I. Katakis, G. Tsoumakas, and I. P. Vlahavas. Multilabel text In P. Cimiano, M. Fernàndez, V. Lopez, S. Schlobach, and classification for automated tag suggestion. In Proceedings J. Völker, editors, The Semantic Web: ESWC 2013 Satellite of the ECML/PKDD-08 Workshop on Discovery Challenge, Events, number 7955 in Lecture Notes in Computer Science, Antwerp, Belgium, 2008. pages 22–33. Springer Berlin Heidelberg, 2013. [8] D. D. Lewis. An evaluation of phrasal and clustered repre- [24] A. Schulz, T. D. Thanh, H. Paulheim, and I. Schweizer. A sentations on a text categorization task. In Proceedings of fine-grained sentiment analysis approach for detecting crisis the 15th Annual International ACM SIGIR Conference on related microposts. In Proceedings of the 10th International Research and Devlopment in Information Retrieval, pages Conference on Information Systems for Crisis Response and 37–50, 1992. Management (ISCRAM), pages 846 – 851, May 2013. [9] D. D. Lewis. Reuters-21578 text categorization test collec- tion distribution 1.0. README file (V 1.3), May 2004. [25] F. Sebastiani. Machine learning in automated text catego- rization. ACM Computing Surveys, 34(1):1–47, Mar. 2002. [10] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A New Benchmark Collection for Text Categorization Re- [26] G. Tsoumakas, I. Katakis, and I. P. Vlahavas. Effective and search. Journal of Machine Learning Research, 5:361–397, efficient multilabel classification in domains with large num- 2004. ber of labels. In Proceedings ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD’08), 2008. [11] E. Loza Mencı́a and J. Fürnkranz. Efficient pairwise multi- label classification for large-scale problems in the legal do- [27] G. Tsoumakas, I. Katakis, and I. P. Vlahavas. Mining multi- main. In Proc. ECML-PKDD-2008, volume 5212 of LNCS, label data. In O. Maimon and L. Rokach, editors, Data pages 50–65, Antwerp, Belgium, 2008. Springer. Mining and Knowledge Discovery Handbook, pages 667–685. Springer, 2010. [12] C. D. Manning, P. Raghavan, and H. Schütze. An Introduc- tion to Information Retrieval, pages 117–120. Cambridge [28] G. Tsoumakas, E. Spyromitros Xioufis, J. Vilcek, and I. P. University Press, 2009. Vlahavas. Mulan: A java library for multi-label learning. Journal of Machine Learning Research, 12:2411–2414, 2011. [13] A. McCallum. Multi-label text classification with a mix- Software available at http://mulan.sourceforge.net/. ture model trained by EM. In AAAI’99 Workshop on Text Learning, pages 1–7, 1999. [29] S. Vieweg, A. L. Hughes, K. Starbird, and L. Palen. Mi- croblogging during two natural hazards events: what twitter [14] A. Montejo Ráez, L. A. Ureña López, and R. Steinberger. may contribute to situational awareness. In Proceedings of Adaptive selection of base classifiers in one-against-all learn- the SIGCHI Conference on Human Factors in Computing ing for large multi-labeled collections. In Advances in Natu- Systems Pages (CHI’10), 2010. ral Language Processing, 4th International Conference (Es- TAL 2004), Alicante, Spain, October 20-22, Proceedings, volume 3230 of Lecture Notes in Computer Science, pages 1–12. Springer, 2004. 33 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014