=Paper=
{{Paper
|id=Vol-1141/paper_01
|storemode=property
|title=Evaluating Multi-label Classification of Incident-related Tweet
|pdfUrl=https://ceur-ws.org/Vol-1141/paper_01.pdf
|volume=Vol-1141
|dblpUrl=https://dblp.org/rec/conf/msm/SchulzMDS14
}}
==Evaluating Multi-label Classification of Incident-related Tweet==
<pdf width="1500px">https://ceur-ws.org/Vol-1141/paper_01.pdf</pdf>
<pre>
      Evaluating Multi-label Classification of Incident-related
                              Tweets

              Axel Schulz*+ Eneldo Loza Mencía+ Thanh Tung Dang† Benedikt Schmidt*
                  Telecooperation Lab
                   *
                                                                Knowledge Engineering Group
                                                                †
                                                                                                            HCI Research
                                                                                                            +

            Technische Universität Darmstadt                   Technische Universität Darmstadt           SAP AG, Darmstadt
                       Germany                                            Germany                             Germany
             {aschulz,benedikt.schmidt}@tk.informatik.tu-darmstadt.de eneldo@ke.tu-darmstadt.de thanh.tung.dang@sap.com


 ABSTRACT                                                                   such as incidents. In the latter case, citizens act as observers
 Microblogs are an important source of information in emer-                 and create valuable incident-related information. For in-
 gency management as lots of situational information is                     stance, during incidents such as the Oklahoma grass fires
 shared, both by citizens and official sources. It has been                 and the Red River floods in April 2009 [29], or the terror-
 shown that incident-related information can be identified                  ist attacks on Mumbai [4], useful situational information was
 in the huge amount of available information using machine                  shared on Twitter. Also, Ushahidi, a social platform used for
 learning. Nevertheless, the currently used classification tech-            crowd-based filtering of information [15], was heavily used
 niques only assign a single label to a micropost, resulting in             during the Haitian earthquake for labeling crisis-related in-
 a loss of important information that would be valuable for                 formation.
 crisis management.                                                            However, the discovery of incident-related information is
    With this paper we contribute the first in-depth analysis               a complex task, requiring the separation of valuable infor-
 of multi-label classification of incident-related tweets. We               mation from daily chatter in the vast amount of information
 present an approach assigning multiple labels to these mes-                created on social platforms. This can be realized based on
 sages, providing additional information about the situation                techniques from data mining and machine learning. Clas-
 at-hand. An evaluation shows that multi-label classifica-                  sification is one method which can be utilized to extract
 tion is applicable for detecting multiple labels with an exact             relevant information from social networks (for tweets, see
 match of 84.35%. Thus, it is a valuable means for classifying              [23]). In a classification task, a system learns to label mes-
 incident-related tweets. Furthermore, we show that correla-                sages with exactly one label out of a predefined label set
 tion between labels can be taken into account for these kinds              (e.g., ”fire” or ”crash”). This task is known as multi-class
 of classification tasks.                                                   classification and widely used for text classification. How-
                                                                            ever, during our research we found that assigning only one
                                                                            label would result in the loss of important situational in-
 Categories and Subject Descriptors                                         formation for decision making in crisis management. For
 H.3.3 [Information Storage and Retrieval]: Infor-                          instance, consider the following tweet:
 mation Search and Retrieval—Information filtering; I.2.6
 [Artificial Intelligence]: Learning                                             THIS CAR HIT THE FIRE HYDRANT AND
                                                                                 CAUGHT FIRE....SOMEONE HOLIDAY AL-
 General Terms                                                                   TERED

 Algorithms, Experimentation
                                                                               A single label would necessarily lack relevant information.
                                                                            A better approach is the concurrent assignment of all three
 Keywords                                                                   labels, which is known as multi-label learning. In the ex-
 Microblogs, Multi-label Learning, Social Media                             ample, all labels (”fire”, ”crash”, and ”injuries”) would be
                                                                            assigned concurrently using an appropriate learning algo-
 1. INTRODUCTION                                                            rithm. The example also shows that the assignment of mul-
   Social media platforms are widely used by citizens for                   tiple labels is not necessarily an independent process. Once
 sharing information covering personal opinions about vari-                 the label for an incident type such as ”crash” is assigned
 ous topics (e.g., politics) as well as information about events            the probability of assigning the label ”injuries” is changing.
                                                                            This dependency is known as label correlation and needs to
                                                                            be investigated in the context of multi-label learning.
                                                                               With our analysis we want to investigate three important
                                                                            aspects of applying multi-label learning on incident-related
 Copyright c 2014 held by author(s)/owner(s); copying permitted             tweets: (1) how to apply multi-label learners on tweets, (2)
 only for private and academic purposes.                                    if the classification accuracy of multi-label classification ap-
 Copyright
 Published isasheld
                partbyofthe
                          theInternational World Wide
                               #Microposts2014        Web Conference
                                                  Workshop           Com-
                                                           proceedings,
                                                                            proaches is comparable to the accuracy of multi-class clas-
 available online as CEUR Vol-1141 (http://ceur-ws.org/Vol-1141)
 mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the     sification approaches, and (3) if correlation between labels
 author’s site if the Material is used in electronic media.
 #Microposts2014,      April 7th,   2014,  Seoul, Korea.
 WWW’14 Companion, April 7–11, 2014, Seoul, Korea.                          is a factor that needs to be taken into account for incident-
 ACM 978-1-4503-2745-9/14/04..                                              related information. With this paper we contribute the first


· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
 in-depth analysis of multi-label classification of incident-      [1]. Sajnani et al. provided a preliminary analysis of multi-
 related tweets. In summary, our contributions are twofold:        label classification of Wikipedia barnstar texts. Barnstars
                                                                   can be awarded by Wikipedia authors and contain a short
    • We show that multi-label classification on incident-         textual explanation why they have been awarded. In this
      related tweets is applicable and able to detect the exact    case, labels for seven work domains have to be differenti-
      combinations of labels in 84.35% of the cases. Thus, we      ated. The authors show which features can be extracted
      show that compared to common multi-class classifica-         from short texts for multi-label classification and evaluate
      tion approaches, multi-label classification of incident-     several multi-label classification approaches. Daxenberger
      related tweets is a valuable means.                          et al. categorize individual edits into non-exclusive classes
                                                                   like vandalism, paraphrase, etc.
    • We evaluate the influence of label correlation on the           Summarized, although many related approaches cope with
      classification results of incident-related tweets. We        multi-class classification of short texts such as microblogs,
      show that for classification tasks label correlation         multi-label classification is an open research issue. Espe-
      needs to be taken into account.                              cially for the domain of crisis management, no prior research
                                                                   on this topic exists.
   The remainder of the paper is organized as follows. First,
 we describe and discuss related approaches. Second, the
 considered multi-label classification algorithms as well as the   3.    MULTI-LABEL CLASSIFICATION
 technical infrastructure (a machine learning pipeline) used          In this section, we give an overview on multi-label
 for the analysis are presented. Next, we introduce our data       classification.          Multi-label classification refers to the
 collection setup and describe the evaluation of our approach.     task of learning a function that maps instances xi =
 We close with a conclusion and future work.                       (xi,1 , . . . , xi,a ) ∈ X ⊆ Ra to label subsets or label vectors
                                                                   yi = (yi,1 , . . . , yi,n ) ∈ {0, 1}n , where L = {λ1 , . . . , λn },
                                                                   n = |L| is a finite set of predefined labels and where each
 2. RELATED WORK                                                   label attribute yi corresponds to the absence (0) or presence
    Techniques of multi-label classification have been applied     (1) of label λi . Thus, in contrast to multi-class classifica-
 to domains such as text categorization [21, 13], music genre      tion, alternatives are not assumed to be mutually exclusive,
 detection [20], or tag recommendation [7]. These applica-         such that multiple labels may be associated with a single
 tion domains address long texts, images, or audio informa-        instance.
 tion. Text is probably one of the oldest domains in which            This makes multi-label data particularly interesting from
 the demand for categorization appeared, particularly multi-       the learning perspective, since, in contrast to binary or
 label categorization [25], with the first multilabel dataset      multi-class classification, there are label dependencies and
 (Reuters-21578 ) used in machine learning research being          interconnections in the data which can be detected and ex-
 from the year 1987 [5, 8, 9]. Moreover, data is easily ac-        ploited in order to obtain additional useful information or
 cessible and processable as well as vastly available. Hence,      just better classification performance. Some examples for
 text classification was also one of the first research fields     our particular Twitter dataset were already shown up in the
 for multi-label classification and continues to be the most       introduction. As we show, around 15% of our tweets could
 represented one among the commonly available benchmark            be assigned to more than one label, thus, we believe that it is
 datasets.1                                                        not unusual to encounter tweets with several possible labels,
    A common application for texts is the classification of        so that in our opinion the view of microblogs as multi-labeled
 news articles [10, 18] for which the research focuses on scal-    data seems more natural, more realistic, and more general.
 ability issues regarding the number of articles and especially    Nonetheless, previous work focuses on the multi-class label-
 the number of labels a text can be assigned to, which can         ing of tweets and this is the first work known to the authors
 sometimes go up to the thousands [11, 26]. News texts,            which tries to exploit label dependencies on tweets.
 as well as abstracts from scientific papers [14] or radiology        In the following, we will describe commonly used ap-
 reports [16] may sometimes be relatively short, but they          proaches for multi-label classification: Binary Relevance
 are usually still structured and homogeneous. This kind of        (BR), Label Powerset (LP), and Classifier Chains (CC).
 multi-label text classification problems were very well an-       All described techniques are based on the decomposition
 alyzed in the past and the used approaches showed to be           or transformation of the original multi-label problem into
 effective (we refer the interested reader to the cited recent     single-label binary problems, as most multi-label techniques
 works).                                                           do [27]. An illustration of these techniques is presented in
    In contrast, texts such as tweets are mostly unstructured      Figure 1. This has the advantage that we can use state-of-
 and noisy, because of their limitations in size and the often     the-art text classification algorithms for learning the binary
 used colloquial language. Related work on such short texts        problems such as support vector machines [25, 6]. We will
 with a focus on solving multi-class problems exists, e.g., for    also have a closer look at each classification approach with
 sentiment analysis [24] or incident detection and classifica-     respect to taking dependencies between labels into account.
 tion [23]. In contrast to these approaches, this paper focuses    Two of the used approaches are specifically tailored in order
 on the use of multi-label classification for tweets.              to cope with such dependencies.
    Applying multi-label learning on very short texts is a topic
 of open research. Only two respective examples are known          3.1    Binary Relevance
 to the authors: Sajnani et al. [19] and Daxenberger et al.           The most common approach for multi-label classification
 1
   Cf. http://mulan.sourceforge.net/datasets.html [28]             is to use an ensemble of binary classifiers, where each clas-
 and http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/                sifier predicts if an instance belongs to one specific class or
 datasets/multilabel.html repositories.                            not. The union of all classes that were predicted is taken


                                                                                                                                   27
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
                                                                                       to predicting correctly each label individually (BR), i.e. we
          xi        Labels ∈ {0, 1}n         xi      Class ∈ {1, . . . , 2n }
                                                                                       usually have a trade-off between both objectives.
          x1       (y1,1 , . . . , y1,n )    x1             σ(y1 )
          x2       (y2,1 , . . . , y2,n )    x2             σ(y2 )                       In addition to the obvious computational costs problem
           ..                                                                          due to the exponential grow of meta-labels, the sparsity of
            .        ...    ..
                                .    ...      ..
                                               .
                                                              ..
                                                               .                       some label combinations, especially with an increasing num-
            (a) input training set          (b) label powerset (LP) de-                ber of labels, often causes that some classes contain only few
                                            composition                                examples. This effect can also be observed in our data, cf.
                                                                                       Table 2.
             xi      Class1 ∈ {0, 1}               xi    Classn ∈ {0, 1}
             x1
             x2
                          y1,1
                          y2,1       ···
                                                   x1
                                                   x2
                                                              y1,n
                                                              y2,n
                                                                                       3.3     Classifier Chains
              ..            ..                      ..          ..                        As stated before in Section 1, it is very likely in our dataset
               .             .                       .           .                     that injured people are mentioned when also any incident
                   (c) binary relevance (BR) decomposition                             type is mentioned (200 of 967 cases). On the other hand, it
                                                                                       seems almost a matter of course that there was an incident
   x0i   Class1 ∈ {0, 1}             x0i ∈ Ra × {0, 1}n−1            Classn ∈ {0, 1}   if there is an injured person. Although this only happens
   x1         y1,1                   (x1 , y1,1 , . . . , y1,n−1 )        y1,n         in 200 out of 232 cases in our data we consider it relevant
   x2         y2,1       ···         (x2 , y2,1 , . . . , y2,n−1 )        y2,n
    ..          ..                                 ..                       ..
                                                                                       for larger data sets. The classifier chains approach (CC) of
     .           .                                  .                        .         Read et al. [17] is able to directly capture such dependencies
                   (d) classifier chains (CC) decomposition                            and has therefore become very popular recently.
                                                                                          The idea of this approach is to construct a chain of n
 Figure 1: Decomposition of multi-label training sets                                  binary classifiers hCC   j , for which (in contrast to BR) each
 into multiclass (LP) or binary (BR, CC) problems.                                     binary base classifier hCC      j   depends on the predictions of
 x0i denotes the augmented instance. During predic-                                    the previous classifiers hCC      1 . . . hCC
                                                                                                                                  j−1 . Particularly, we ex-
 tion, yi,1 , yi,2 , . . . in the extended input space is re-                          tent the feature space of the training instances for the base
 placed by the predictions by hCC            CC
                                        1 , h2 , . . . (see text).                     classifier hCC
                                                                                                   j   to ((xi,1 . . . xi,a , yi,1 . . . yi,j−1 ), yi,j ). Since the
                                                                                       true labels yi are not known during prediction, CC uses
                                                                                       the predictions of the preceding base classifiers instead.
 as the multi-label output. This approach is comparable to                             Hence, the unknown yj are replaced by the predictions
 classical one-against-all for a multi-class problem. Formally,                        ŷj = hCC
                                                                                               j (x, ŷ1 . . . ŷj−1 ).
 we convert a training example pair (xi , yi ) into n separate                            This shows up one problematic aspect of this approach,
 pairs (xi , yi,j ), j = 1 . . . n, one for each of the n base clas-                   namely the order of the classifiers in the chain. Depending
 sifiers hj . The predicted labels ŷj for a test instance x are                       on the ordering, CC can only capture one direction of depen-
 then the result of hj (x) ∈ {0, 1}.                                                   dency between two labels. More specifically, CC can only
    This method is fast and simple, however, it is not able to                         capture the dependencies of yi on y1 , . . . , yi−1 , but there is
 take label dependencies into account since each base classi-                          no possibility to consider dependencies of yi on yi+1 , . . . , yn .
 fier is trained independently from the other classifiers. As                          Recovering our example from the beginning, we can either
 was recently stated by Dembczynski et. al [2], this is not                            learn the dependency of the label incident given injury or
 necessarily a disadvantage if the objective is to obtain good                         the other way around, but not both. In addition, the ef-
 label-wise predictions, such as measured by the Hamming                               fect of error propagation caused by the chaining structure
 loss (cf. Section 5). Therefore, BR serves as a fairly good                           may also depend on the label permutation. We will evaluate
 performing baseline for our experiments.                                              the effect of choosing different orderings for our particular
                                                                                       dataset later on in Section 5.3.
                                                                                          Furthermore, CC has advantages compared to LP. CC
                                                                                       is considered to predict the correct label-set, such as LP
                                                                                       [2], but unlike LP, CC is able to predict label combinations
 3.2     Label Powerset                                                                which were not seen beforehand in the training data. In ad-
    The basic idea of this algorithm is to transform multi-                            dition, the imbalance between positive and negative training
 label problems into a multi-class classification problem by                           examples is generally lower than for LP.
 considering each member of the powerset of labels in the
 training set as a single class. Hence, each training example                          4.    MULTI-LABEL CLASSIFICATION OF
 is converted into (xi , σ(yi )) with σ, σ −1 denoting a bijective
 function that maps between the label powerset of L and a                                    INCIDENT-RELATED TWEETS
 set of 2n meta-classes. The classifier hLP is trained e.g. with                          In the following, the data used for multi-label classifica-
 one-against-all (like in our setting), and the prediction for x                       tion of incident-related tweets is described in detail. The
 is obtained with σ −1 (hLP (x)).                                                      taken approach is composed of three steps. As a first step,
    LP takes label dependencies into account to some extent,                           unstructured text has to be converted into structured text.
 as each distinct occurrence of a label pattern is treated as                          As a second step, the structured information needs to be
 a new class. It is hence able to model the joint label distri-                        transformed to features that can be used by a multi-label
 bution, but not explicitly and directly specific dependencies                         learner. Third, these features are used to train and evaluate
 (correlations, implications, etc.) between labels. As a conse-                        a classifier.
 quence, LP is tailored towards predicting exactly the correct
 label combination. As it is pointed out in [2] and contrary to                        4.1     Preprocessing of Unstructured Text
 what one may believe at first, this stays usually in contrast                           Our overall goal is to apply text mining on short docu-


                                                                                                                                                               28
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
 ments that are present in social media, thus, they need to               the number of ’ !’ and ’ ?’ in a tweet and the number
 be represented by a set of features. As texts in social media            of capitalized characters.
 are mostly unstructured, they first need to be converted into
 a representation which enables feature generation. Hence, as           • Spatial features: As location mentions are replaced
 a first step, we apply Natural Language Processing. Firstly,             with a corresponding token, they appear as word uni-
 we remove all re-tweets as these are just duplicates of other            grams in our model and can therefore be regarded as
 tweets and do not provide additional information. Secondly,              additional features.
 @-mentions of Twitter users are removed from the tweet
 message as we want to prevent overfitting towards certain         4.3     Dataset
 user tokens. Before further processing is applied, the text is
                                                                      We focus on three different incident types throughout the
 converted to Unicode, as some tweets contain non-Unicode
                                                                   paper in order to differentiate incident-related tweets. Three
 characters. Third, abbreviations are resolved using a dic-
                                                                   classes have been chosen, because we identified them as the
 tionary of abbreviations based on the data provided by the
                                                                   most common incident types using the Seattle Real Time
 Internet Slang Dictionary&Translator2 . Then, we identify
                                                                   Fire Calls dataset3 , which is a frequently updated source
 and replace URLs with a common token ”URL”. As a next
                                                                   for official incident information. We included also injury as
 step, stopwords are removed. This is important as very fre-
                                                                   an additional label. This results in four labels consisting
 quent words have limited influence when it comes to classi-
                                                                   of very common and distinct incident types and the injury
 fying tweets due to their relative frequency. Based on the
                                                                   label: Fire, Shooting, Crash, and Injury.
 resulting text, we conduct tokenization. Thus, the text is
                                                                      We collected public tweets in English language using the
 divided into discrete words (tokens) based on different de-
                                                                   Twitter Search API, which provides geotagged tweets as well
 limiters such as white spaces. Every token is then analyzed
                                                                   as tweets for which Twitter inferred a geolocation based on
 and non-alphanumeric characters are removed or replaced.
                                                                   the user profile. For the collection, we used a 15km radius
 Also, lemmatization is applied to normalize all tokens. Ad-
                                                                   around the city centers of Seattle, WA and Memphis, TN.
 ditionally to the common NLP processing steps, we identify
                                                                   We focused on only two cities, as for our analyses we are
 and replace location mentions such as ”Seattle” with a com-
                                                                   interested in the stream of tweets for these cities and a spe-
 mon token to allow semantic abstraction. For this, we use
                                                                   cific time period instead of a scattered sample of the world,
 the approach presented in [23] to detect named entities re-
                                                                   which could be retrieved using the Twitter Streaming API.
 ferring to locations (so-called location mentions) in tweets
                                                                   This gave us a set of 7.5M tweets collected from 11/19/12
 and to replace them with two tokens ”LOC” and ”PLACE”.
                                                                   to 02/07/13. Though we know about the limitations of the
 4.2     Feature Generation                                        Search API, we think that we collected a relevant sample for
    After finishing the initial preprocessing steps, we ex-        our experiments.
 tracted several features from the tweets that are used for           The dataset was further reduced to be usable for high
 training a classifier. We conducted a comprehensive feature       quality labeling as well as the machine learning experiment.
 selection, analyzing the value of each feature for the over-      We first identified and extracted tweets mentioning incident-
 all classification performance. We compared word-n-grams,         related keywords. Compared to other approaches that com-
 char-n-grams, TF-IDF [12] scores as well as syntactic fea-        pletely rely on filtering using hashtags, we take the whole
 tures such as the number of explanation marks, question           message into account for identifying incident-related key-
 marks, and upper case characters. We found that the fol-          words. We retrieved a set of different incident types using
 lowing features are the most beneficial for our classification    the ”Seattle Real Time Fire 911 Calls” dataset and defined
 problems:                                                         one general keyword set with keywords that are used in
                                                                   all types of incidents like ’incident’, ’injury’, ’police’, etc.
      • Word 3-gram extraction: We extract word three-grams        For each incident type, we further identified specific key-
        from the tweet message. Each 3-gram is represented         words. For instance, for the incident type ’Motor Vehicle
        by two attributes. One attribute indicating the pres-      Accident Freeway’ we use the keywords ’vehicle’, ’accident’,
        ence of the 3-gram and another attribute indicating        and ’road’. Based on these words, we use WordNet4 to ex-
        the frequency of the 3-gram.                               tend this set by adding the direct hyponyms. For instance,
                                                                   the keyword ’accident’ was extended with ’collision’, ’crash’,
      • Sum of TF-IDF scores: For every document we calcu-         ’wreck’, ’injury’, ’fatal accident’, and ’casualty’. Based on
        late the accumulated TF-IDF (term-frequency inverse-       these incident-related keywords, we filtered the datasets.
        document-frequency) score [12] based on the single TF-     Furthermore, we removed all re-tweets, as the originated
        IDF scores of each term in the document. The rational      tweets are also contained in our datasets and only these are
        behind this is to create a similarity score which is not   needed for our experiments. Based on this filtered dataset,
        as strict as traditional TF-IDF scores, but allows form-   we randomly selected 20.000 tweets.
        ing of clusters of similar documents.                         The selected tweets have been labeled manually by one
                                                                   researcher of our department. Out of these tweets, we ran-
      • Syntactic features: Along with the features directly
                                                                   domly selected 2.000 tweets for further re-labeling for our
        extracted from a tweet, several syntactic features are
                                                                   multi-label classification problem. Those tweets were manu-
        expected to improve the performance of our approach.
                                                                   ally examined by five researchers using an online survey. To
        People might tend to use a lot of punctuations, such
                                                                   assign the final coding, we differentiated between two types
        as explanation marks and question marks, or a lot of
                                                                   of agreement:
        capitalized letters when they are reporting some inci-
        dent. In this case, we extract the following features:     3
                                                                       http://data.seattle.gov
 2                                                                 4
     http://www.noslang.com                                            http://wordnet.princeton.edu


                                                                                                                              29
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
 Table 1: Overview of real-world incident types used for extraction of incident-related keywords as well as
 and the number of extracted keywords for keyword-based classification approach.
  Class               Fire                       Shooting                Crash                     Injury
  Real-World Incident Fire In Building           Assault w/Weap          Motor Vehicle Accident       -
  Type
                      Fire In Single Family Res  Assault w/Weapons Aid Motor Vehicle Accident
                                                                         Freeway
                      Automatic Fire Alarm Resd                          Medic Response Freeway
                      Auto Fire Alarm                                    Car Fire
                                                                         Car Fire Freeway
  # of Keywords       148                        36                      73                          23


                                                                    5.    EVALUATION
 Table 2: Distribution of the 10 label combinations
 occurring in the 2000 tweets of the dataset.                          In the following section, we provide the evaluation results
         Label Combination    Number of Tweets                      for the presented multi-label classification approaches on our
                  {}                971                             dataset. We also present the result for a keyword-based ap-
                                                                    proach as a simple way for conducting multi-label classifica-
                {Fire}              313
                                                                    tion.
             {Shooting}             184
               {Crash}              268                             5.1    Evaluation Setup
               {Injury}              32
                                                                       We performed our experiments with Mulan, an open-
            {Crash, Fire}            2                              source library for multi-label classification based on Weka
           {Injury, Crash}           47                             [28]. We used two learners for our evaluation. First, we use
         {Injury, Shooting}         149                             the LibLinear implementation of support vector machines
            {Injury, Fire}           33                             with linear kernel [3] as our base learner. We use the default
        {Injury, Fire, Crash}        1                              settings, as we found that additional parameter optimization
                                                                    was not beneficial for improving the overall classification re-
                                                                    sults. Second, we used the Weka implementation of Naive
    • if four out of five coders agree on one label, only this      Bayes. The results were obtained using 10-fold cross valida-
      label is assigned                                             tion.
                                                                       The evaluation of multi-label problems requires different
    • if less than four coders agree on one label, all labels       measures compared to those used for multi-class problems.
      which at least two coders assumed as correct are as-          In our paper, we use the following metrics:
      signed as possible labels and further verified in a group        Exact Match: Exact match is the percentage of the m
      discussion                                                    test instances for which the labelsets were exactly correctly
                                                                    classified (with [[z]] as indicator function returning 1 if z is
    The final labeled dataset consists of 10 different label com-   true, otherwise 0)
 binations. The distribution for every combination is outlined
                                                                                                       m
 in Table 2. The distribution indicates that around 15% (232)                                       1 X
 of all tweets in our dataset have been labeled with multiple                  ExactM atch(h) =           [[yi = h(xi )]]       (1)
                                                                                                    m i=1
 labels. Another observation is that almost exactly 50% of
 the tweets do not have any label assigned, which is rather un-        Hamming Loss: The instance-wise Hamming loss [22]
 usual compared to typically used and analyzed multi-label          is defined as the percentage of wrong or missed labels com-
 datasets5 . In addition, the label cardinality, i.e., the av-      pared to the total number of labels in the dataset. In this
 erage number of labels assigned to an instance, is around          case, it is taken into account that an incorrect label is pre-
 0.59, whereas common datasets have at least more than 1            dicted and that a relevant label is not predicted. As this is
 assigned. On the other hand, this is mainly due to the low         a loss function, the optimal value is zero.
 number of total labels, since the label density (the aver-            Recall, Precision and F1: We use micro-averaged
 age percentage of labels which are true) is 15%, which is a        precision and recall measures to evaluate our results, i.e.,
 relatively high value. From a multi-label learning perspec-        we compute a two-class confusion matrix for each label
 tive, this is an interesting property of this dataset since it     (yi = 1 vs. yi = 0) and eventually aggregate the results
 is not clear how commonly used techniques will behave un-          by (component-wise) summing up all n matrices into one
 der this circumstance. For example, many algorithms ignore         global confusion matrix (cf. [27]). Recall and precision is
 instances without any label given.                                 computed based on this global matrix in the usual way, F1
                                                                    denotes the unweighted harmonic mean between precision
                                                                    and recall. In Section 5, we also report recall, precision and
                                                                    F1 for each label using the label-wise confusion matrices.
 5
   We refer to the repository at http://mulan.sourceforge.
 net/datasets.html for an overview of the statistics of the
                                                                    5.2    Results for Keyword-Based Filtering
 commonly used benchmark datasets in multi-label classifi-            As mentioned before, we use a keyword-based pre-filtering
 cation                                                             for selecting an initial set of tweets that is suitable for la-


                                                                                                                               30
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
                  84,60%
                  84,40%
                  84,20%
                  84,00%
    Exact Match   83,80%
                  83,60%
                  83,40%
                  83,20%
                  83,00%
                  82,80%
                  82,60%


                                                                   Label Combinations


                                Figure 2: Percentages of exact matches for all label combinations.


 beling. A first and simple approach for detecting incident             be explained on the one hand by the generally good individ-
 related tweets is to use these keywords for classification.            ual prediction performance for Shooting (cf., Table 5), hence
   In Table 1, the real-world incident types from the Seattle           leading to low error propagation, and on the other hand by
 Real Time Fire Calls dataset and the corresponding number              the resulting label dependencies given the Shooting label is
 of extracted keywords is shown. For the injury class, no               known: for instance, we can see from Table 2 that we can
 specific type in the Seattle dataset could be found, thus, we          safely exclude Crash or Fire if there was a Shooting. This
 extended the set with a manually created list of keywords              shows that our initial assumption that correlation between
 and their direct hyponyms.                                             labels needs to be taken into account is true.
   The results for classifying each individual class are shown             Based on the respective best (MAX) and the worst se-
 in Table 3. The results indicate that precision as well as             quence (MIN), we compared CC to the multi-label ap-
 recall are rather low. Only for the fire class a high recall           proaches with the two different base learners. In Table 4
 could be achieved.                                                     these evaluation results are shown. The first observation
                                                                        is that Naive Bayes is not adequate for classifying tweets,
                                                                        since though it achieves the best recall values using CC, this
 Table 3: Precision and recall for each individual la-
                                                                        is in exchange of very low results on the remaining metrics
 bel when applying keyword-based classification.
                                                                        and approaches. We will therefore focus on the results ob-
               Shooting   Fire     Crash    Injury
                                                                        tained by applying LibLinear as base learner. The results
     Precision   31.59% 54.12% 15.04% 63.64%
                                                                        show that, if there is the opportunity of pre-optimizing the
       Recall    68.77% 95.99% 49.37% 37.40%
                                                                        ordering of the labels, e.g., by performing a cross-validation
                                                                        on the training data, then classifier chains is able to slightly
   Furthermore, if the keywords would be used for applying              outperform the other approaches, which is most likely be-
 multi-label classification, a precision of 32.22% and a recall         cause the label correlation is valuable. This is also reflected
 of 64.90% is achieved, which is a rather bad result. Also              in the good performance with respect to exact match, where
 exact match (28.45%) and h-loss (27.08%) are bad, thus, we             the worst CC even outperforms LP, which is particularly tai-
 conclude that with simple keyword-based filtering, multi-              lored towards matching the exact label combination. Note
 label classification cannot be done accurately.                        also that LP is a common approach used for circumventing
                                                                        the need for a multi-label classification by creating meta-
 5.3              Results for Multi-Label Classification                classes, as already mentioned in the introduction. However,
    As a first step, we coped with the question if correlation          this approach is always inferior to the compared approaches,
 between labels is taken into account and beneficial for the            which demonstrates the need for more advanced techniques
 classification results. Thus, we evaluated all different label         in this particular use case.
 sequences using the classifier chains algorithm for our labels            We can also observe that improving the prediction of the
 Fire (F), Shooting (S), Crash (C), and Injury (I). The values          exact label combinations may come at the expense of re-
 for exact match for each sequence are shown in Figure 2                ducing the performance on label-wise measures, since the
 (using SVM as our base learner).                                       additional features used by CC generally lead to a higher
    The results indicate that the label sequence has indeed             potential deterioration (MIN) than potential improvement
 an influence on the classification performance. In our case,           (MAX) for Hamming loss, recall, precision and F1, whereas
 we get a difference of 1% between the best sequence Shoot-             for exact match this is not as clear.
 ing, Crash, Fire, Injury and the worst Injury, Crash, Fire,               As a last evaluation step, we evaluated the accuracy of
 Shooting. Also, we see that the Injury label is best used              each approach for every individual label. This is important
 after incident labels have been classified - for the best cases        as we want to understand how well a classifier performs for
 even as one of the last labels in the sequence. It is also re-         each label. The following Table 5 depicts the accuracy of
 markable that classifying Shooting as first label followed up          individual labels using SVM with the best label order.
 by either Crash or Fire is always a good option. This can


                                                                                                                                   31
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
   Table 4: Results for the different multi-label approaches and base learners obtained by cross-validation.
                                     Naive Bayes                                SVM
                        BR       LP     CC - MIN CC - MAX        BR       LP     CC - MIN CC - MAX
           Exact Match       59.60%          66.95%         71.15%              72.45%    83.85%      83.05%      83.25%        84.35%
                H-Loss       15.02%          14.08%         9.400%              9.175%    4.688%      5.313%      4.900%        4.588%
                     F1      52.19%          55.37%         72.90%              73.61%    83.55%      81.53%      82.80%        84.02%
              Precision      52.40%          55.34%         66.84%              67.92%    93.61%      90.28%      92.75%        93.46%
                 Recall      51.98%          55.39%         79.63%              80.35%    75.44%      74.35%      74.72%        76.47%


                                                                                       The following misclassified tweets show examples for such
 Table 5: Precision and recall for each individual la-                               wrongly classified instances:
 bel.
           BR (SVM)       LP (SVM)        CC (SVM)                                        ”Tacoma Fire Department replaces 3 fire
                                                                                          engines with pickup trucks:         TACOMA
           Prec.       Recall     Prec.       Recall        Prec.    Recall               Cutbacks     within the      Tacoma   Fire...
  Shooting 95.7%       79.3%      92.0%       76.9%         95.7%    79.3%                http://t.co/jPe2kuKG” ( {} -> {F} )
  Fire     94.7%       82.0%      90.3%       83.0%         93.3%    83.7%
  Crash    90.8%       77.4%      88.0%       78.3%         90.9%    78.3%                ”This girl is on fire. This girl is on fire.
                                                                                          She’s walking on fire. This girl is on fire -
  Injury   92.9%       59.5%      91.1%       54.6%         93.0%    61.0%
                                                                                          Alicia Keys #deep”, ( {} -> {S} )

                                                                                          ”NeoMemphis News:        Massive fire at fac-
                                                                                          tory in Ripley: Action News 5 is on the scene
    The results show that the precision for individual labels is                          of a factory fire at ... http://t.co/brfnVbWp
 high with about 90% to 95% for each label, which is much                                 #memphis”, ( {F} -> {F,I} )
 better compared to the keyword-based classification. The
 differences between all approaches are nearly the same, thus,
 all approaches seem to be appropriate for classifying the                              The examples show that certain words such as ”fire” or
 individual labels. However, the recall drops significantly,                         digits in the message might lead to wrong classifications.
 depending on the label type. For instance, injuries often                           This could be avoided by adding additional features or with
 remain undetected. In this case, classifier chains show the                         a larger training set.
 best results for precision and recall. Note that the results for                       In this section we have first shown that a simple keyword-
 BR and CC on Shooting are the same, since the first classifier                      based classification approach is not suitable for multi-label
 in the CC ordering is exactly trained like the corresponding                        classification. Second, we presented results of state-of-
 BR classifier (cf. also Figure 1). This also shows that along                       the-art multi-label classification approaches and we showed
 the chain, CC slightly reduces the good precision of BR in                          that these perform quite well for classifying incident-related
 exchange of improved recall.                                                        tweets. Compared to current approaches for the classifica-
                                                                                     tion of microblogs, which rely on assigning only one label to
                                                                                     an instance, the results show that it is possible to infer im-
 5.4      Discussion                                                                 portant situational information with only one classification
    Though the results show the advantage of multi-label clas-                       step. The results also indicate that the label sequence has an
 sification, we want to understand the limitations of our ap-                        influence on the classification performance, thus, this factor
 proach. Thus, we first created a confusion matrix for the                           should be taken into account for following approaches.
 classifier chains approach with the best label order. The
 matrix shows that most misclassifications occur due to an
 assignment of instances to the ”no incident” label combina-
                                                                                     6.   CONCLUSION
 tion {}. The other wrong classifications are mostly a result                           In this paper we have shown how to apply multi-label
 of not detecting the injury label or of predicting it wrongly.                      learning on social media data for classification of incident-
                                                                                     related tweets. Furthermore, we analyzed that we are able to
                                                                                     identify multiple labels with an exact match of 84.35%. This
                                                                                     is an important finding, as multiple labels assigned with one
 Table 6: Confusion matrix. The rows indicate the
                                                                                     classification approach provide important information about
 predicted/true label combinations and the columns
                                                                                     the situation at-hand, which could not be easily derived from
 the true/predicted ones.
                                                                                     previously used multi-class classification approaches. Fur-
            ∅    F     C    F,C    I   F,I    C,I   F,C,I     S     F,S   I,S
                                                                                     thermore, we have shown that the natural relation of labels,
      ∅   924    16    24     0    0    0      0       0       3     0     4         which represents for instance the relation between incidents
      F    49   261     0     0    0    3      0       0       0     0     0
     C     54     0   213     1    0    0      0       0       0     0     0         and injuries in the real-world, can be used and exploited by
   F,C      1     1     0     0    0    0      0       0       0     0     0         classification approaches in order to obtain better results.
      I    16     0     1     0   11    0      0       0       1     0     3            For future work, we aim to add costs to our classifications.
    F,I     5    10     0     0    1   17      0       0       0     0     0
    C,I     8     0    12     0    3    0     23       0       0     0     1
                                                                                     For instance, not detecting incident labels should be heavily
  F,C,I     1     0     0     0    0    0      0       0       0     0     0         punished compared to misclassifying the incident type. Fur-
      S    33     4     0     0    1    0      0       0     142     0     4         thermore, we aim to improve the overall performance of our
    F,S     0     0     0     0    0    0      0       0       0     0     0         approach by taking different features and a larger training
    I,S    26     0     0     0    5    0      0       0      22     0    96
                                                                                     set into account.


                                                                                                                                               32
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
 Acknowledgements                                                        [15] O. Okolloh. Ushahidi, or ’testimony’: Web 2.0 tools for
                                                                              crowdsourcing crisis information. Participatory Learning
 This work has been partly funded by the German Federal                       and Action, 59(January):65–70, 2008.
 Ministry for Education and Research (BMBF, 01|S12054).
                                                                         [16] J. P. Pestian, C. Brew, P. Matykiewicz, D. Hovermale,
 References                                                                   N. Johnson, K. B. Cohen, and W. Duch. A shared task
                                                                              involving multi-label classification of clinical free text. In
  [1] J. Daxenberger and I. Gurevych. A corpus-based study of                 Proceedings of the Workshop on BioNLP 2007: Biological,
      edit categories in featured and non-featured wikipedia arti-            Translational, and Clinical Language Processing, pages 97–
      cles. In Proceedings of the 24th International Conference on            104. Association of Computational Linguistics, June 2007.
      Computational Linguistics (COLING 2012), pages 711–726,
      Dec. 2012.                                                         [17] J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classi-
                                                                              fier chains for multi-label classification. Machine Learning,
  [2] K. Dembczynski, W. Waegeman, W. Cheng, and E. Hüller-                  85(3):333–359, June 2011.
      meier. On label dependence and loss minimization in multi-
      label classification. Machine Learning, 88(1-2):5–45, 2012.        [18] T. N. Rubin, A. Chambers, P. Smyth, and M. Steyvers. Sta-
                                                                              tistical topic models for multi-label document classification.
  [3] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J.              Machine Learning, 88(1-2):157–208, 2012.
      Lin. Liblinear: A library for large linear classification. Jour-
      nal of Machine Learning Research, 9:1871–1874, aug 2008.           [19] H. Sajnani, S. Javanmardi, D. W. McDonald, and C. V.
                                                                              Lopes. Multi-label classification of short text: A study on
  [4] R. Goolsby. Lifting Elephants: Twitter and Blogging in                  wikipedia barnstars. In Analyzing Microtext, 2011.
      Global Perspective. In Social Computing and Behavioral
      Modeling. 2009.                                                    [20] C. Sanden and J. Z. Zhang. Enhancing multi-label music
                                                                              genre classification through ensemble techniques. In Proceed-
  [5] P. J. Hayes and S. P. Weinstein. CONSTRUE/TIS: A system                 ings of the 34th international ACM SIGIR conference on
      for content-based indexing of a database of news stories. In            Research and development in Information Retrieval, pages
      A. T. Rappaport and R. G. Smith, editors, Proceedings of                705–714. ACM, 2011.
      the 2nd Conference on Innovative Applications of Artificial
      Intelligence (IAAI-90), May 1-3, 1990, Washington, DC,             [21] R. E. Schapire and Y. Singer. Boostexter: A boosting-
      USA, IAAI ’90, pages 49–64. AAAI Press, Chicago, IL, USA,               based system for text categorization. Machine learning,
      1991.                                                                   39(2-3):135–168, 2000.
  [6] T. Joachims. Text categorization with support vector ma-           [22] R. E. Schapire and Y. Singer. BoosTexter: A Boosting-
      chines: Learning with many relevant features. In C. Nédel-             based System for Text Categorization. Machine Learning,
      lec and C. Rouveirol, editors, Proceedings of 10th European             39(2/3):135–168, 2000.
      Conference on Machine Learning (ECML-98), pages 137–
      142, Chemnitz, Germany, 1998. Springer-Verlag.                     [23] A. Schulz, P. Ristoski, and H. Paulheim. I see a car crash:
                                                                              Real-time detection of small scale incidents in microblogs.
  [7] I. Katakis, G. Tsoumakas, and I. P. Vlahavas. Multilabel text           In P. Cimiano, M. Fernàndez, V. Lopez, S. Schlobach, and
      classification for automated tag suggestion. In Proceedings             J. Völker, editors, The Semantic Web: ESWC 2013 Satellite
      of the ECML/PKDD-08 Workshop on Discovery Challenge,                    Events, number 7955 in Lecture Notes in Computer Science,
      Antwerp, Belgium, 2008.                                                 pages 22–33. Springer Berlin Heidelberg, 2013.
  [8] D. D. Lewis. An evaluation of phrasal and clustered repre-
                                                                         [24] A. Schulz, T. D. Thanh, H. Paulheim, and I. Schweizer. A
      sentations on a text categorization task. In Proceedings of
                                                                              fine-grained sentiment analysis approach for detecting crisis
      the 15th Annual International ACM SIGIR Conference on
                                                                              related microposts. In Proceedings of the 10th International
      Research and Devlopment in Information Retrieval, pages
                                                                              Conference on Information Systems for Crisis Response and
      37–50, 1992.
                                                                              Management (ISCRAM), pages 846 – 851, May 2013.
  [9] D. D. Lewis. Reuters-21578 text categorization test collec-
      tion distribution 1.0. README file (V 1.3), May 2004.              [25] F. Sebastiani. Machine learning in automated text catego-
                                                                              rization. ACM Computing Surveys, 34(1):1–47, Mar. 2002.
 [10] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1:
      A New Benchmark Collection for Text Categorization Re-             [26] G. Tsoumakas, I. Katakis, and I. P. Vlahavas. Effective and
      search. Journal of Machine Learning Research, 5:361–397,                efficient multilabel classification in domains with large num-
      2004.                                                                   ber of labels. In Proceedings ECML/PKDD 2008 Workshop
                                                                              on Mining Multidimensional Data (MMD’08), 2008.
 [11] E. Loza Mencı́a and J. Fürnkranz. Efficient pairwise multi-
      label classification for large-scale problems in the legal do-     [27] G. Tsoumakas, I. Katakis, and I. P. Vlahavas. Mining multi-
      main. In Proc. ECML-PKDD-2008, volume 5212 of LNCS,                     label data. In O. Maimon and L. Rokach, editors, Data
      pages 50–65, Antwerp, Belgium, 2008. Springer.                          Mining and Knowledge Discovery Handbook, pages 667–685.
                                                                              Springer, 2010.
 [12] C. D. Manning, P. Raghavan, and H. Schütze. An Introduc-
      tion to Information Retrieval, pages 117–120. Cambridge            [28] G. Tsoumakas, E. Spyromitros Xioufis, J. Vilcek, and I. P.
      University Press, 2009.                                                 Vlahavas. Mulan: A java library for multi-label learning.
                                                                              Journal of Machine Learning Research, 12:2411–2414, 2011.
 [13] A. McCallum. Multi-label text classification with a mix-                Software available at http://mulan.sourceforge.net/.
      ture model trained by EM. In AAAI’99 Workshop on Text
      Learning, pages 1–7, 1999.                                         [29] S. Vieweg, A. L. Hughes, K. Starbird, and L. Palen. Mi-
                                                                              croblogging during two natural hazards events: what twitter
 [14] A. Montejo Ráez, L. A. Ureña López, and R. Steinberger.              may contribute to situational awareness. In Proceedings of
      Adaptive selection of base classifiers in one-against-all learn-        the SIGCHI Conference on Human Factors in Computing
      ing for large multi-labeled collections. In Advances in Natu-           Systems Pages (CHI’10), 2010.
      ral Language Processing, 4th International Conference (Es-
      TAL 2004), Alicante, Spain, October 20-22, Proceedings,
      volume 3230 of Lecture Notes in Computer Science, pages
      1–12. Springer, 2004.


                                                                                                                                       33
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014

</pre>