<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multilabel Classification Evaluation using Ontology Information</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefanie Nowak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hanna Lukashevich</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Digital Media Technology IDMT Ehrenbergstrasse 31</institution>
          ,
          <addr-line>98693 Ilmenau</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Multilabel classification using ontology information is an emerging research area that combines machine learning methods with knowledge models. The performance assessment of such classification systems poses new challenges. We propose an evaluation measure that considers the mapping of label sets to their groundtruth and allows for the incorporation of real world knowledge. A distance-based measure from the area of hierarchical unilabel classification evaluation is extended to the case of multilabel classification and enriched with additional ontology information. The evaluation measure considers structure information, relationships and the agreement between annotators.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>2/14
1/14
1/14
1/14
4/14
4/14</p>
      <p>
        4/14
hasPersons
1/14
2/14
2/14
Many researchers regard the results of a multilabel classification system
similar to the unilabel classification approach. The prediction is evaluated for each
concept in isolation. This allows to use the well-known evaluation measures like
precision, recall, f-measure or accuracy. E.g., Fan et al. use the accuracy in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
to determine the quality of the classifier for each concept. The opposite way is
to start with the multimedia document and evaluate if all concepts are assigned
correctly. Then instead of comparing a single predicted label to a single
groundtruth label, one needs to compare two sets of labels. As a result, the predicted
labels can be fully correct (label sets are identical), fully wrong (the intersection
of the sets is empty), or partly correct (the sets have common labels, but are
not fully identical). The Accuracy that is often used as measure in traditional
classification evaluation cannot be used to judge partial matches, because it only
regards one instance as correctly classified if all associated labels were correctly
predicted. Partial matches are e.g., considered by utilizing the macro-average
and micro-average f-measures, proposed by Tague in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Shen et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed
an α-evaluation and multilabel class recall and precision. α-evaluation generates
a score while taking the ground-truth, predicted labels, missed labels and false
positive labels into account. Moreover, false positives and missed labels can be
penalized differently as it is more suitable for the particular application. The
parameter α introduces the so-called forgiveness rate as a trade-off between the
fully correct and partly correct prediction. Sun et al. propose in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
semanticbased misclassification costs. Each class of documents is represented by a feature
vector of all documents belonging to a certain category. The cosine distance
between feature vectors of two categories is used as similarity measure and defines
the average category similarity.
      </p>
      <p>
        A hierarchical organization of concepts enables a hierarchical evaluation.
Different hierarchical measures for unilabel classification are summarized in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Intuitively the concepts, that are located near in a hierarchy are more similar than
the ones that are located far. The idea is to judge an annotation from the
predictor that does not match exactly to the groundtruth by their distance in the
hierarchy. The most important measures are the depth independent distance-based
misclassification costs and the depth dependent distance-based misclassification
costs. In the former case, the predicted concept is compared to the correct one
and the number of edges of the shortest path in the hierarchy between both are
counted. In the latter case, an additional weight is assigned to each edge in the
hierarchy. So, misclassifications in deeper levels of the hierarchy have lower costs
than at an upper level. An evaluation measure for hierarchical
multiclassification evaluation, is proposed by Blockeel et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. They extrapolate distances
between individual labels to distances between sets of labels by mapping the
feature vectors of the sets into Euclidean space where the individual labels form
the base vectors. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] a hierarchical loss function is proposed that considers
classification into a hierarchy with multiple and partial paths. The first wrongly
classified node is regarded as mistake and adds to the loss while sub-nodes after
a mistake are not considered. Underlying is the assumption that for each
classification a path from root to leaf or from root to an internal node is present.
They compare their work to the zero-one loss and symmetric-difference loss.
3
3.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Evaluation Measure</title>
      <sec id="sec-2-1">
        <title>Requirements on the Evaluation Measure</title>
        <p>The proposed evaluation measure is utilized in a benchmark for multilabel
annotation of photos.1 The participant’s task is to annotate photos with 53 concepts
in a multilabel annotation scenario. A small ontology of the concepts is
provided that may be used for training the classifiers. To objectively compare the
approaches of the participants, different conditions have to be assured.</p>
        <p>First, the evaluation measure should consider partial matches and deliver
an annotation score for each image. Second, it has to be assured that a system
that annotates plants instead of trees is judged better than a system that
annotates mountains (see Fig. 1). This issue is considered in the depth
dependent distance-based misclassification costs, introduced in Sec. 2 for the unilabel
annotation task. In a multilabel annotation scenario, the challenge is how to
map the predicted label set to the groundtruth set. Third, the groundtruthing
of the concepts depicted in an image, is no easy task, also not for humans. Some
concept assignments are not objective and last in long discussions between the
annotators. Therefore a study about the agreement on concepts between
different annotators was conducted. Altogether 10 annotators annotated the same
100 images on their own. The degree of agreement for each concept over all
photos was computed and should serve as a probability if the groundtruthing
1 http://www.imageclef.org/2009/PhotoAnnotation
was correct. E.g., for the concept clouds in 96% the annotators agreed on
annotating the concept. In case of Aesthetic Image the annotators agreed only
at 75%. These empirically obtained values are used as weighting factors for the
calculated costs in case of misclassification. So, the more subjective concepts are
weighted less than the more objective ones. Fourth, the relationships defined in
the ontology should be kept. There should be an option to penalize a system,
if it simultaneously annotates concepts that are defined as disjoint or ignores
preconditions for relationships.</p>
        <p>None of the described evaluation measures in Sec. 2 fulfils these requirements.
E.g., the hierarchical loss measure assumes that every node of the hierarchy
possibly has annotation instances and that a continuous path exists through the
hierarchy. In the ontology for image annotation, abstract nodes are defined that
represent real-world knowledge but no visual concept. An example is the node
representation. A portrait is a visual subconcept of representation and
can be annotated by a classification system, but the concept representation
itself is no visual concept.
3.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Evaluation Measure for Multiannotation Scenarios</title>
        <p>Let us consider the predicted set of labels as P and the groundtruth set of labels
as G. Each set contains labels li respectively lj that are assigned to a multimedia
document X. First, the false positive labels P 0 = P \ (P ∩ G) and the missed
labels G0 = G \ (P ∩ G) are computed. Please note that |P 0| + |G0| ≤ |P ∪ G| is
always valid, because the number of false positive and missed labels can never be
greater than the number of the union of labels in both sets. Next, for each label
li from P 0 a match to a label lj from G is calculated and for each label lj from
G0 a mapping to a label li from P is performed in an optimization procedure
(see Eq. 1). The costs between two labels li and lj depend on the shortest path
in the hierarchy between both concepts. Each link is associated with a cost that
is cut in halves for each deeper level of the tree and is maximal equal to 1 for a
path between two leaf nodes of the deepest level (see Fig. 1). The costs ci for a
2(i−1)
link are calculated as ci = 2 PN</p>
        <p>
          · i=1 2(i−1) with N as the number of links from the
deepest node to the root. If P = ∅, the matching costs for all labels lj of G0 = G
are set to the maximum. The matching costs are computed as follows:
match(P, G) = X
li∈P 0
(min cost(li, lj)) · a(lj∗) + X
lj∈G lj∈G0
(min cost(li, lj)) · a(lj) (1)
li∈P
with lj∗ = argminlj∈G(cost(li, lj)).
a(lj) determines the annotation agreement factor for a concept lj and ranges
between [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]. It is empirically determined through the agreement of different
annotators on a concept, as described in 3.1. Optionally, a crosscheck on the
predicted label set P is performed during computing match(P, G). If labels in
P violate relationships from the ontology, these labels get the maximum costs
of 1 as penalty assigned instead of calculating the minimal costs to a label of
G. Referring to the example in Fig. 1, a penalty of 1 is assigned if the system
simultaneously annotates single and small group or if portrait is annotated
without annotating one of the person concepts. Assuming that single is correct in
the first example, the costs between small group from P 0 and single from G are
equal 1 instead of only equal 2/14 · a(single) because of the hierarchy distances.
score(X) =
1 −
|P ∪ G|
The final score for each multimedia document X is based on the matching costs
between P and G divided by the number of different concepts in both label sets
(see Eq. 2). The score is 1 if all concepts are correctly annotated and goes to 0 if
no concept was found. Additionally, Shens α-factor, (α ≥ 0), introduced in Sec.
2, is incorporated to weight the strictness of the score regarding fully and partly
correct annotations, depending on the application demands.
match(P, G) α
(2)
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we propose an evaluation measure for the performance assessment
of multiannotation classification systems incorporating ontology knowledge. A
distance-based misclassification cost was extended from the unilabel to the
multilabel case and further enriched with ontology information like its hierarchy,
an annotation agreement factor and penalties for ignoring relationships. Next,
an extensive evaluation of the behaviour of the measure in a real benchmarking
scenario will be conducted. In future work, we would like to investigate how
relationships in ontologies can be incorporated more differentiated in dependence
from the evaluation scenario. Another point is how to base the evaluation
measure on semantic similarity of concepts instead of distances in hierarchies, as the
structure of the ontology is rather subjective and may change during time.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jain</surname>
          </string-name>
          , R.:
          <article-title>Mining multilevel image semantics via hierarchical classification</article-title>
          .
          <source>IEEE Trans. on Multimedia</source>
          <volume>10</volume>
          (
          <issue>2</issue>
          ) (
          <year>2008</year>
          )
          <fpage>167</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Tague</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>Information retrieval experiment</article-title>
          . In Jones, K.S., ed.:
          <article-title>The pragmatics of information retrieval experimentation</article-title>
          , Butterworths, London. (
          <year>1981</year>
          )
          <fpage>59</fpage>
          -
          <lpage>102</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boutell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , C.
          <article-title>: Multi-label machine learning and its application to semantic scene classification</article-title>
          .
          <source>In: Intern. Symp. on Elec. Imag</source>
          . (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lim</surname>
          </string-name>
          , E.:
          <article-title>Hierarchical text classification and evaluation</article-title>
          .
          <source>In: Proc. of the IEEE Intern. Conf. on Data Mining</source>
          . Volume
          <volume>528</volume>
          ., California, USA (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Freitas</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>de Carvalho</surname>
          </string-name>
          , A.:
          <article-title>A tutorial on hierarchical classification with applications in bioinformatics</article-title>
          .
          <source>Intelligent Information Technologies: Concepts</source>
          ,
          <source>Methodologies, Tools and Applications</source>
          (
          <year>2007</year>
          )
          <fpage>114</fpage>
          -
          <lpage>140</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Blockeel</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bruynooghe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Džeroski</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramon</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Struyf</surname>
          </string-name>
          , J.:
          <article-title>Hierarchical multi-classification</article-title>
          .
          <source>In: Proc. of Workshop on Multi-Relational Data Mining</source>
          . (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Cesa-Bianchi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gentile</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaniboni</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Incremental algorithms for hierarchical classification</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>7</volume>
          (
          <year>2006</year>
          )
          <fpage>31</fpage>
          -
          <lpage>54</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>