<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Time to evaluate: Targeting Annotation Tools</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Peyman Sazedj</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>H. Sofia Pinto</string-name>
          <email>a.pinto@dei.ist.utl.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto Superior T ́ecnico, Departamento de Eng. Informa ́tica</institution>
          ,
          <addr-line>Av. Rovisco Pais, 1049-001 Lisboa</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <fpage>37</fpage>
      <lpage>48</lpage>
      <abstract>
        <p>One of the problems newcomers to the annotation area face is which tool to choose. In order to compare available tools one must evaluate them according to an evaluation framework. In this paper we propose an evaluation framework, that introduces general tool-focused criteria - interface and general - and annotation-focused criteria - metadata and procedure. We describe how the evaluation of a set of tools was performed and the results of such a comparison. Our results may not only be useful to newcomers, but also to experts who may use them to seize new opportunities for development.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The term annotation either designates metadata, or a procedure (that produces
metadata). Based on this definition, an annotation tool is a tool that either allows
to manipulate annotation metadata, that operates an annotation procedure, or
both. Annotation metadata can have a textual, ontological or linguistic nature.
Annotations are associated to resources, which can take several forms: web pages,
images, video. If we take these dimensions there is already a considerable wealth
of tools that fall into the annotation tool category. In our case we are interested
in ontological annotation tools that deal with web pages.</p>
      <p>
        Moreover, there are several other dimensions that characterize and
distinguish annotation tools. Therefore, one of the problems newcomers to the
annotation area face is which tool to choose. Although some initial attempts to
describe some of the available tools have been made [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], these attempts aimed
only to list and describe some functionality issues (languages offered, kind of
annotation procedure supported - manual or automatic -, etc.) However, due to
the vast array of different features provided by these tools it would be interesting
to rank them according to a given set of features.
      </p>
      <p>In order to compare available tools one must evaluate them according to an
evaluation framework. In this paper we propose an evaluation framework, section
3 that introduces domain specific and domain independent criteria. These criteria
range from general tool-focused criteria – interface and general – to
annotationfocused criteria - metadata and procedure. We describe how the evaluation of
a set of tools was performed, section 4 and the results of such a comparison,
section 5 and take some initial conclusions from our findings, section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>Choosing the Tool Set</title>
      <p>In order to select among existing annotation tools, we chose two simple selection
criteria. The first criterion captures the requirement of providing explicit formal
meaning to annotations. From among the technologies we have seen around
so far, only ontology-based tools assign semantics to their data. Thus, we were
only concerned with the evaluation of ontology-based annotation tools. The
diversity of resources available on the web, lead us to impose a second criterion.
Since the largest part of the world wide web is composed of web pages, and since
web pages are the main medium for hosting other types of resources, we were
mostly interested in the annotation of web pages.</p>
      <p>
        Based on these two criteria, a set of 5 tools was selected: Kim Plugin 1.05
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (Sirma Inc.), Melita 2.0 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] (University of Sheffield), MnM 2.1 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] (KMI),
Ontomat 0.8 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (AIFB) and C-Pankow [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] (AIFB).1 Even though Ontomat is
supposed to support automatic annotation, due to technical reasons, we could
only operate it in manual mode. Melita and MnM are semi-automatic
annotation tools based on the Amilcare Information Extraction Engine [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Kim and
C-Pankow are fully automatic and unsupervised tools. Kim is a semantic
information platform, where only the Kim annotation plugin and its annotations
were evaluated. With regard to C-Pankow, we evaluated its web-based
implementation2.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation Framework</title>
      <p>We propose an evaluation framework for the evaluation of annotation tools,
based on a set of well-defined criteria. Our hope is to define each criterion as
clearly as possible, encouraging a rigorous and fair evaluation. For that purpose,
a metric is associated with each criterion, quantifying its measurement as
accurately as possible. Since perfect quantification rather exceeds our scope, our
metrics are mostly significant to a relative extent. In other words, the final
quantification of a tool can only be meaningful when compared to the quantification
of another tool, evaluated by the same framework according to the same set of
criteria. Moreover, we try to identify as many criteria as possible, representing
the diverse issues dealt with by the area of semantic annotation to a fair extent.</p>
      <p>We propose a set of 20 criteria which are classified into four dimensions:
general, interface, metadata and procedure. We argue that any criterion is either
domain-independent or domain-specific. In turn, domain-independent criteria are
divided into interface and general criteria, whereas domain-specific criteria are
divided into metadata and procedure criteria. Even though some of our interface
criteria may apparently seem to have domain-specific metrics, the criteria
themselves are essentially domain independent and their metrics can be adapted to
any domain of choice.</p>
      <p>Based on their metrics, our criteria may be classified into three types:
1 Smore 5.0 was not evaluated due to lack of support of annotation metadata.
2 http://km.aifb.uni-karlsruhe.de/services/pankow/annotation/
1. Feature-based criteria
2. Set-dependent criteria
3. Set-independent criteria</p>
      <p>Feature-based criteria are those which verify the existence of specific features
or functionalities. Most of our domain-specific criteria are of this type.
Setdependent criteria are those which depend on some property inherent to the
whole set of tools under evaluation. In other words, the same tool may score
differently if evaluated within distinct sets of tools. Set-independent criteria are
based on mathematical metrics which are independent of the tool set.</p>
      <p>In table 1 we present a list of all criteria. Feature-based criteria are indicated
with (f), while set-dependent criteria are marked with (d). Criteria with no marks
are set-independent. The asterisk indicates that a criterion is only applicable to
tools with support for automatic annotation.</p>
      <p>Since we classified our criteria into one of three types, we present a detailed
example of each type within the remainder of this section. Table 2 summarizes
our domain-specific criteria. Due to lack of space we omit general and interface
criteria tables.</p>
      <p>Heterogeneity is a feature-based criterion, defined as the quality of being
able to combine diverse types of knowledge. The criterion defines the following
features: (1) The tool permits the simultaneous loading of multiple Ontologies;
(2) The tool supports the annotation of the same “unit” with multiple concepts.
The metric is simply based on assigning a score of one point per feature. Since the
criterion defines two features, a tool may score zero, one or two points depending
on whether it fulfills none, one or both features.</p>
      <p>Interoperability is a set-dependent criterion, defined as the ability to
exchange and use information among different tools. Its metric is defined as:
(</p>
      <p>Fr
Fr.max
+</p>
      <p>Fw
Fw.max</p>
      <p>1
) ∗ 2 ∗ 100%
Consider a set of tools T . A tool t ∈ T is considered inter-operable with a
tool t ∈ T where t = t , if it can write in a format which t can read and if
it can read the formats which t can write. In order to be 100% interoperable
with all tools, a tool t needs to be able to read all formats which the tools
∀t ∈ T , t = t can write and needs to be able to write a subset of formats F ,
Name
Association
Flexibility
Integrity
Scope</p>
      <p>
        Short Definition Range Metric
The way an anno- [
        <xref ref-type="bibr" rid="ref2">0,2</xref>
        ] 1 point per feature:
tation is associated (1) Tool supports external annotations.
with the annotated (2) Tool supports internal annotations.
resource.
      </p>
      <p>
        The quality of being [
        <xref ref-type="bibr" rid="ref3">0,3</xref>
        ] 1 point per feature: (1) It is possible to delete
preadaptable or vari- viously created annotations. (2) There is a
comable. ponent for editing ontologies. (3) It is possible to
define a custom namespace to be used with the
annotations.
      </p>
      <p>
        The capacity of en- [
        <xref ref-type="bibr" rid="ref3">0,3</xref>
        ] 1 point per feature: (1) Tool verifies the domain
suring the soundness and target constrains of a relation. (2) Tool
cerand integrity of an tifies the correctness of an association over time.
annotation. (3) Tool verifies the existence of an association.
      </p>
      <p>
        The scope of an [
        <xref ref-type="bibr" rid="ref3">0,3</xref>
        ] 1 point per feature:
annotation corre- (1) Annotate the minimum unit of information.
sponds to the parts (2) Annotate any multiple of that minimum unit.
of a resource to (3) Annotate the resource as a whole, instead of
which the annota- the multiple of all units.
      </p>
      <p>tion applies.
such that any tool t is able to read at least one of the formats f ∈ F . Based on
this discussion, Fr.max is the cardinality of the set of all formats which can be
written by annotation tools different from the one under evaluation. Fw.max is
the cardinality of a minimum set F of formats such that any tool (other than
the one under evaluation) should be able to read one of the formats f ∈ F . Fr
is the number of different annotation formats the tool can read and Fw is the
number of different formats the tool can write.</p>
      <p>Simplicity is a set-independent criterion, defined as the quality of being
simple, intuitive and easy to understand. Its metric is based on a set of
subcriteria where each sub-criterion is assigned a rating of 1 or 0, depending on
whether the criterion applies or not:</p>
      <p>S = Lc + Ef + Cl + Co
The learning curve (Lc) designates the ease of learning the tool. If it is assigned
a value of 1, then the tool can be learned within a few hours (typically less than
5), otherwise, the tool is more complex and eventually the help of an expert
is needed in order to understand some of the provided features. By learning a
tool, we mean not only how to produce annotations, but also how to work with
most of the features provided. Thus, tools with many features typically have a
higher learning curve. By efficiency (Ef ) we designate the simplicity of producing
annotations, once the tool has been learned. A heuristic for measuring efficiency
is whether the tool allows to create a valid annotation, in less operations than
most other tools. Even so, additional details have to be taken into account:
Some tools provide error-checking while others don’t. On the long term, tools
that do provide error-checking may be more time saving than tools that don’t,
even though the latter may create annotations in less steps. We understand
that this criterion is slightly subjective as it is. Clearness (Cl) determines how
well a novice user gets along with simple tasks. This criterion is different from
learning curve, because it is not concerned with understanding all the features
and functionalities of the tool. Instead, the question is whether it is easy to
locate a specific feature. For example, giving a novice user the task of opening
an ontology from disk, the question is whether the interface is clear enough for
him to easily perform this task. Finally, consistency (Co) measures whether the
tool is in conformance with actual standards and whether it is consistent with
itself. For example, if there are different icons to perform the same operation,
the tool is inconsistent. It also has to be taken into account whether other tools
that perform similar operations, use the same icons or not. The same is true for
the hierarchy and names of menus.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation Procedure</title>
      <p>Having defined all criteria, the next step was to devise how the evaluation of the
selected tools should be carried out. It was clear that not all tools could be
evaluated according to all criteria. For example, our precision, recall and reliability
criteria are only applicable to semi-automatic and automatic annotation tools
and cannot be applied to manual annotation tools (Ontomat). MnM, Melita,
Kim and C-Pankow were evaluated according to all criteria. A word must be
said concerning the applicability of our stability criterion with regard to remote
services, since both Kim and C-Pankow were tested as remote services. Let’s
suppose the invocation of a service sometimes fails. Due to lack of evidence,
we cannot draw any conclusion regarding the cause of failure. The problem can
either lie at the remote service, or at our internet connection, or elsewhere. Just
in case a service never fails, we can assume that it is stable.</p>
      <p>
        Although most criteria were of simple application, some criteria required
detailed preparation. The usability criterion was measured by means of Jacob
Nielsen’s heuristic evaluation method [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], with the help of a team of three expert
evaluators. All other criteria were measured by a single domain expert.
      </p>
      <p>Tools with support for automatic annotation were tested against a set of
corpora which were manually tagged. Within the following sections we describe the
corpora and the ontologies which we used. We also describe an inter-annotation
experiment which was carried out.
4.1</p>
      <sec id="sec-4-1">
        <title>Corpora</title>
        <p>The first corpus was the Planet Visits corpus, courtesy of the Knowledge
Media Institute (KMI), a well known corpus, since it comes included with some
annotation tools. We selected a set of 30 web pages, each containing one news
article. The particularity of this corpus is that all news are about visits, either by
someone from the KMI or by some external entity to the KMI. The documents
of this corpus may be considered unstructured text.</p>
        <p>Our second unstructured corpus was a smaller collection of news articles
extracted from the Baha’i International Community world news service.3 The
corpus consists of three larger articles, with a total count of 1582 words.</p>
        <p>Finally a third tabular corpus was created specifically for the evaluation of
semi-automatic annotation tools, based on the observation that the performance
of semi-automatic tools depends on the structure of documents. Since annotation
with semi-automatic tools was not successful on the previous corpora, we created
this tabular corpus hoping to obtain useful results. The corpus consists of a single
table of 100 rows4, and two columns, where the annotation task is as simple as
classifying the words of the second column, with the corresponding ontological
concepts of the first column. Concerning the two sample entries in table 3, Siscog
would have to be annotated as an Organization and Germany as a Location.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Ontologies</title>
        <p>
          After choosing the corpora, specific ontologies had to be designed for the
annotation of each corpus. This task was a tedious one, due to the apparent lack of
interoperability between tools. Among the tools with support for automatized
annotation, KIM works with its own meta-ontology, C-Pankow and MnM
support ontologies formalized in RDF[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], and Melita works with its own proprietary
ontology language. Additionally, Melita only supports annotation with
ontological concepts, whereas MnM only supports annotation with ontological relations,
therefore we were forced to redesign the same ontology for use with each tool.
        </p>
        <p>A simple Visits ontology was designed for the annotation of the Planet
Visits corpus. The ontology contains 11 concepts and 14 relations such as location,
date, duration, organization, visitor, visited, host, etc.</p>
        <p>For use with the tabular corpus, a very simple Conference ontology was
designed. The corpus only contains the names of persons, organizations,
locations and dates, therefore we created a simple ontology containing these
four concepts.</p>
        <p>Finally, for use with the Baha’i News corpus, the WordNet5 ontology and
Kim’s own Kimo6 ontology were used.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3 Inter-annotation experiment</title>
        <p>
          It was clear that semi-automatic and automatic annotation tools had to be
tested against manually tagged corpora and that the manual tagging of a corpus
would not be that obvious at all. To gain more insight, we conducted a simple
inter-annotation experiment. Two knowledge engineers were asked to annotate
all named entities of a subset of 1433 words of the Planet Visits corpus with
the Visits ontology. The annotations of both experts were compared with the
help of the kappa-statistic [
          <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
          ] which corrects for chance agreement. Two
important results can be extracted from this experiment. An agreement-rate of
0.79 was obtained, when only considering entities annotated by both experts,
showing that the classification with ontological concepts was quite well defined.
But when also considering terms annotated by only one of the experts (and left
blank by the other), we obtained a much lower rate of 0.435. This leads us to
conclude that the task of named entity recognition was quite loosely defined.
We also believe that sometimes entities may have been ignored due to lack of
relevance. We borrow the notion of relevance [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] to describe this phenomenon.
For example, it may be asked how relevant it is to annotate distinct references
to the same named entity. For example, if the name Paul Cohen appears several
times within the same paragraph or sentence, does it have to be annotated
each time? To give yet another example, consider the annotation of a date such
as 11/08/2003. Is it relevant to say that 11 is a number, or does it suffice to
annotate it as a day? Our experience shows that the answer to this question is
5 http://wordnet.princeton.edu/perl/webwn
6 http://www.ontotext.com/kim/kimo.rdfs
apparently a complex one, mostly depending on the purpose of the annotation
task and the structure of the ontology. Our main conclusion here is that unless
the notion of relevance is clearly defined for the annotation task, the agreement
rate can be expected to degrade.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>
        We start this section by presenting a quantitative evaluation of semi-automatic
and automatic annotation procedures, according to the traditional Precision,
Recall and F1 measures [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Even though these measures are incorporated into the
precision, recall and reliability criteria of our evaluation framework, we present
them separately due to their important role in the past and for the sake of
facilitating comparison with previous evaluations in this field. Please refer to table
2 in section 3 for the definition of these criteria. We then summarize the full
scope of our results, according to the four dimensions of the proposed evaluation
framework and highlight some of the most important technical drawbacks and
features of the evaluated tools. At the end we present a summary of features
which may guide the newcomer in his choice.
5.1
      </p>
      <sec id="sec-5-1">
        <title>The famous Precision, Recall and F1 Measures</title>
        <p>Much care must be given to the definition of evaluation set-ups, in order to
produce reproducible and meaningful results. Table 4 defines the different
setups which were used with each tool. Please refer to sections 4.1 and 4.2 for the
description of the choice of corpora and ontologies.</p>
        <p>The first corpus to be tested was the Planet Visits corpus. To our
disappointment, semi-automatic tools, Melita and MnM, failed entirely in learning
to replicate annotations on this corpus. Even though much larger corpora could
have produced some results, we concluded that the tools were generally not
adequate for unstructured corpora (typical websites). Based on the results of the
previous experiments, we attempted a much simpler corpus: the tabular corpus.
Annotation of this corpus produced successful results. In order to have equal
bases of comparison among all tools, we also tested Kim and C-Pankow on this
corpus with the same ontology. Results are presented in table 5. C-Pankow is
excluded from this table since it only produced one annotation on this corpus.</p>
        <p>Table 4 further indicates that automatic tools Kim and C-Pankow were
successfully evaluated on both unstructured news corpora. For Kim we obtained an
F1 score of 59% (P=63.8%, R=54.8%) on the Planet Visits corpus and a score
of 64.9% (P=70.6%, R=60%) on the Baha’i corpus, whereas C-Pankow scored
16.4% (P=21.2%, R=13.4%) and 32.9% (P=36.4%, R=30%) on those corpora.</p>
      </sec>
      <sec id="sec-5-2">
        <title>The new evaluation framework</title>
        <p>
          According to the proposed evaluation framework, we distinguish among two
annotation-specific dimensions: procedure and metadata. Even though the
traditional measures discussed in the previous section are an integral part of
procedure criteria, we suggest that a holistic view includes other important criteria
as well. In conformance with the procedure criteria presented earlier (Table 2
of Sect. 3), summarized results of the procedure dimension as well as all other
dimensions will be presented at the end of this section (Fig. 1). We can see that
both semi-automatic tools produce similar results, since they are based on the
same information extraction engine. With regard to automatic tools, scores in
Figure 1 are based on the annotation of the Baha’i corpus. When comparing
the results of Kim with those of semi-automatic tools, the future of the latter
tools is clearly under question taking into account the simplicity of the former.
Although C-Pankow’s results fall slightly behind, it promises to work well on
domain-specific corpora [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] where Kim’s upper-ontology is of little use.
        </p>
        <p>
          In the arena of metadata, Ontomat clearly scores best. Interestingly,
semiautomatic tools Melita and MnM obtain the same medium scores, whereas
automatic tools Kim and C-Pankow fall slightly behind. The equal scores of the
semi-automatic tools reflect the fact that they have similar metadata
components and that they both author annotations in a similar mark-up based on
XML [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Even so, both have an interoperability of 0: the XML language lacks
formal semantics, therefore the semantics of the annotations are those intended
by the authors of the tools. Since both tools follow different naming conventions
for the creation of the XML tags (annotations), there is no formal way of
establishing that the annotations of one tool are the same as the annotations of the
other. For example, considering an ontology which has the concept House as a
subclass of Thing, whenever an entity is annotated as a house, Melita annotates
the entity with the tag &lt;house&gt;, whereas MnM creates the tag &lt;thouse&gt;. The
tools follow different naming conventions and due to the lack of semantics of the
XML language, there is no way of establishing that both tags refer to the same
concept.
        </p>
        <p>
          We point out that Ontomat’s annotations are of high quality, formalized as
external annotations in OWL [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Additionally, they may be embedded into web
pages, simulating internal annotations. Although not an essential requirement
of an annotation tool, Ontomat’s ontology creation component is also extremely
handy. With regard to automatic annotation tools, their low scores can be
justified by the fact that they are simple interfaces to remote annotation services,
and as such, they don’t have a component for editing annotations or ontologies.
Low scores in this dimension are mostly due to features that could be present in
the tools but are not present in their current state. While such scores might be
discouraging to newcomers, experts and researchers may interpret them as new
opportunities for development.
        </p>
        <p>Kim distinguishes itself as the tool which scores best in the scope criterion.
This is due to the fact that it can mark-up all occurrences of the same entity, as
belonging to the same instance of the ontology.</p>
        <p>Likewise, tools were evaluated according to the domain-independent
dimensions (Fig. 1). The final evaluation scores of each dimension are given by the
arithmetic mean of the normalized scores of its criteria. We recall that the scores
have no absolute meaning, but rather a relative one. Therefore it is only
meaningful to compare tools with similar functionality. Concerning full-featured manual
tools, the ideal tool should aim to combine the simplicity of Melita’s interface,
with the annotation-related features of Ontomat. Interestingly all tools scored
medium with respect to our general criteria. Melita has a very good
documentation, but scores low in scalability. On the contrary, Ontomat is scalable but
has very little documentation (only a simple tutorial is available). Also, it has
several stability issues, which is not surprising since the tool is officially marked
as being in alpha development stage. MnM, on the other hand, scores medium
in all general criteria.</p>
        <p>90
80
70
60
50
40
30
20
10
0</p>
        <p>Interface</p>
        <p>Metadata Procedure Procedure General
(manual) (automatic)</p>
        <p>Melita
MnM
Ontomat
Kim</p>
        <p>C-Pankow</p>
        <p>Concerning the speed of annotation procedures, it was not possible to apply
our speed criterion due to the fact that we had no way of finding out which
named entities each tool analyzed for annotation. Therefore we have no real
indicator for the speed of the tools. Even so, we can affirm that semi-automatic
procedures only took several seconds of processing. Kim usually took 4 seconds
per page (of approximately 150 words), whereas C-Pankow took several minutes
for the same page.</p>
        <p>Finally, a brief reference for newcomers is presented in Table 6. Due to lack of
space, only some of the most important aspects of our framework are included.
With regard to a feature, a classification of ++ indicates excellence, a single +
indicates the feature exists and is fine, +- stands for medium to poor and - stands
for absent. In case a feature is not applicable to a given tool, this is indicated
with na.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>In this paper we present an evaluation framework for ontological annotation
tools. A set of 5 tools were evaluated using this framework. One of the main
problems faced by all annotation tools is lack of interoperability. Given the fact
that probably one tool is not enough to perform perfect annotation of the web
and that probably one should aim at a combined multi-tool approach this is
a severe problem. Regarding manual annotation tools our main conclusion is
that they do not provide a good enough compromise: some are too focused on
the interface, while others focus too much on the metadata. They lack a good
compromise on both issues. Regarding semi-automatic tools the main conclusion
is they only work with structured corpora, which is not the most common case in
the current web. Finally, regarding automatic annotation tools current results are
rather impressive, but it is difficult to assess how they could scale. For instance
it is difficult to foresee how Kim would perform in domain specific webpages.
Therefore, to our understanding, they still have a long way to go before they
can be fully used to annotate the current web. The long term goal of the work
reported in this paper is to contribute to the automatic annotation field. Our
plans include trying to improve one of current tools.</p>
      <p>Acknowledgments Our special thanks go to Bruno Grilo and Rudi Araujo (inter-annotation
experiment) and to Claudio Gil and Vitor Oliveira (expert usability analysis). We extend our
gratitude to Philipp Cimiano for his constant availability and help with C-Pankow. We would also like to
thank Aditya Kalyanpur, Borislav Popov, Enrico Motta, Fabio Ciravegna, Jeff Heflin and Siegfried
Handschuh for their replies.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>A. G</surname>
          </string-name>
          <article-title>´omez-P´erez, M. Fernand´ez-Lo´pez, and</article-title>
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          , Ontological Engineering., Springer-Verlag London Limited,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>B.</given-names>
            <surname>Popov</surname>
          </string-name>
          , et al, Kim
          <article-title>- semantic annotation platform</article-title>
          .,
          <string-name>
            <surname>in</surname>
            <given-names>ISWC</given-names>
          </string-name>
          , pp.
          <fpage>834</fpage>
          -
          <lpage>849</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>F.</given-names>
            <surname>Ciravegna</surname>
          </string-name>
          , et al,
          <article-title>User-system cooperation in document annotation based on information extraction</article-title>
          .,
          <string-name>
            <surname>in</surname>
            <given-names>EKAW</given-names>
          </string-name>
          , pp.
          <fpage>122</fpage>
          -
          <lpage>137</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M.</given-names>
            <surname>Vargas-Vera</surname>
          </string-name>
          , et al,
          <article-title>Mnm: Ontology driven semi-automatic and automatic support for semantic markup</article-title>
          .,
          <string-name>
            <surname>in</surname>
            <given-names>EKAW</given-names>
          </string-name>
          , pp.
          <fpage>379</fpage>
          -
          <lpage>391</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S.</given-names>
            <surname>Handschuh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Ciravegna</surname>
          </string-name>
          , S-cream
          <string-name>
            <surname>-</surname>
          </string-name>
          semi
          <article-title>-automatic creation of metadata</article-title>
          .,
          <string-name>
            <surname>in</surname>
            <given-names>EKAW</given-names>
          </string-name>
          , pp.
          <fpage>358</fpage>
          -
          <lpage>372</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          , G. Ladwig, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          ,
          <article-title>Gimme' the context: context-driven automatic semantic annotation with c-pankow</article-title>
          .,
          <string-name>
            <surname>in</surname>
            <given-names>WWW</given-names>
          </string-name>
          , pp.
          <fpage>332</fpage>
          -
          <lpage>341</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>F.</given-names>
            <surname>Ciravegna</surname>
          </string-name>
          , et al,
          <article-title>Amilcare: adaptive information extraction for document annotation</article-title>
          .,
          <string-name>
            <surname>in</surname>
            <given-names>SIGIR</given-names>
          </string-name>
          , pp.
          <fpage>367</fpage>
          -
          <lpage>368</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>J.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          ,
          <article-title>Finding usability problems through heuristic evaluation</article-title>
          .,
          <string-name>
            <surname>in</surname>
            <given-names>CHI</given-names>
          </string-name>
          , pp.
          <fpage>373</fpage>
          -
          <lpage>380</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>9. Rdf specification, http://www.w3.org/TR/RDF/.</mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>J.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <article-title>A coecient of agreement for nominal scales</article-title>
          .
          <source>Educational and Psychological measurements</source>
          , volume
          <volume>20</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>46</lpage>
          ,
          <year>1960</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>J. Carletta</surname>
          </string-name>
          ,
          <article-title>Assessing agreement on classification tasks: The kappa statistic</article-title>
          .
          <source>Computational Linguistics</source>
          , volume
          <volume>22</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>249</fpage>
          -
          <lpage>254</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>C. J. van Rijsbergen</surname>
          </string-name>
          , Information Retrieval.,
          <string-name>
            <surname>Butterworth</surname>
          </string-name>
          ,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>13. Xml specification, http://www.w3.org/XML/.</mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>14. Owl specification, http://www.w3.org/TR/owl-features/.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>