<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>S-CREAM  Semi-automatic CREAtion of Metadata</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Steffen Staab</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Richly interlinked, machine-understandable data constitute the basis for the Semantic Web. We provide a framework, SCREAM, that allows for creation of metadata and is trainable for a speci c domain. Annotating web documents is one of the major techniques for creating metadata on the web. The implementation of S-CREAM, OntoMat supports now the semi-automatic annotation of web pages. This semi-automatic annotation is based on the information extraction component Amilcare. OntoMat extract with the help of Amilcare knowledge structure from web pages through the use of knowledge extraction rules. These rules are the result of a learningcycle based on already annotated pages.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Semantic Web builds on metadata describing the contents of
Web pages. In particular, the Semantic Web requires relational
metadata, i.e. metadata that describe how resource descriptions instantiate
class de nitions and how they are semantically interlinked by
properties. To support the construction of relational metadata, we have
provided an annotation [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and authoring [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] framework (CREAM
 manually CREAting Metadata) and a tool (OntoMat) that
implements this framework. Nevertheless, providing plenty of relational
metadata by annotation, i.e. conceptual mark-up of text passages,
remained a laborious task.
      </p>
      <p>
        Though there existed the high-level idea that wrappers and
information extraction components could be used to facilitate the work
[
        <xref ref-type="bibr" rid="ref15 ref8">8, 15</xref>
        ], a full- edged integration that dealt with all the
conceptual dif culties was still lacking. Therefore, we have developed
SCREAM (Semi-automatic CREAtion of Metadata), an annotation
framework that integrates a learnable information extraction
component (viz. Amilcare [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]).
      </p>
      <p>Amilcare is a system that learns information extraction rules from
manually marked-up input. S-CREAM aligns conceptual markup,
which de nes relational metadata, (such as provided through
OntoMat) with semantic and indicative tagging (such as produced by
Amilcare).</p>
      <p>There are two major type of problems that we had to solve for this
purpose:
1. When comparing the desired relational metadata from manual
markup and the semantic tagging provided by information
extraction systems, one recognizes that the output of this type of systems
is underspeci ed for the purpose of the Semantic Web. In
particular, the nesting of relationships between different types of
concept instances is unde ned and, hence, more comprehensive graph
structures may not be produced (further elaboration in Section 4).
In order to overcome this problem, we introduce a new processing
component, viz. a lightweight module for discourse representation
(Section 5).
2</p>
    </sec>
    <sec id="sec-2">
      <title>CREAM/OntoMat</title>
      <p>CREAM is an annotation and authoring framework suited for the
easy and comfortable creation of relational metadata. OntoMat is its
concrete implementation. Before we sketch some of the capabilities
of CREAM/OntoMat, we rst describe its assumptions on its output
representation and some terminology we use subsequently.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Relational Metadata</title>
      <p>
        We elaborate the terminology here because many of the terms that are
used with regard to metadata creation tools carry several, ambiguous
connotations that imply conceptually important differences:
Ontology: An ontology is a formal, explicit speci cation of a
shared conceptualization of a domain of interest [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In our case it
is constituted by statements expressing de nitions of DAML+OIL
classes and properties [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Annotations: An annotation in our context is a set of
instantiations attached to an HTML document. We distinguish (i)
instantiations of DAML+OIL classes, (ii) instantiated properties from one
class instance to a datatype instance  henceforth called attribute
instance (of the class instance), and (iii) instantiated properties
from one class instance to another class instance  henceforth
called relationship instance.</p>
      <p>Class instances have unique URIs, e.g. like
'urn:rdf:936694d5ca907974ea16565de20c997a-0'.3
They frequently come with attribute instances, such as a
human-readable label like `Dobbertin'.</p>
      <p>Metadata: Metadata are data about data. In our context the
annotations are metadata about the HTML documents.</p>
      <p>Relational Metadata: We use the term relational metadata to
denote the annotations that contain relationship instances.</p>
      <sec id="sec-3-1">
        <title>In the OntoMat implementation we create the URIs with the createUniqueResource method of the RDF-API</title>
        <p>Often, the term annotation is used to mean something like
private or shared note, comment or Dublin Core metadata. This
alternative meaning of annotation may be emulated in our
approach by modelling these notes with attribute instances. For
instance, a comment note I like this paper would be related to the
URL of the paper via an attribute instance `hasComment'.
In contrast, relational metadata also contain statements like: The
hotel Zwei Linden is located in the city Dobbertin., i.e.
relational metadata contain relationships between class instances
rather than only textual notes.</p>
        <p>currently only available in German at
http://ontobroker.semanticweb.org/ontos/compontos/tourism I1.daml
2.2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Modes of Interaction</title>
      <p>
        The objective of CREAM is to allow for the easy generation of
target representations such as just illustrated. This objective should be
achieved irrespective of the mode of interaction. In the latest version
of CREAM [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] there existed three major modes:
1. Annotation by typing statements involves working almost
exclusively with the ontology browser and fact templates.
2. Annotation by markup involves reuse of data from the document
editor in the ontology browser by rst marking document parts and
drag'n'dropping them onto the ontology.
3. Annotation by authoring web pages involves the reuse of data
from the ontology and fact browser in the document editor by
drag'n'drop.
      </p>
      <p>OntoMat usually embeds the resulting annotation into the HTML
document, but it can also be stored in a separate le or database.
3</p>
    </sec>
    <sec id="sec-5">
      <title>Amilcare</title>
      <p>Amilcare is a tool for adaptive Information Extraction from text (IE)
designed for supporting active annotation of documents for
Knowledge Management (KM). It performs IE by enriching texts with
XML annotations, i.e. the system marks the extracted information
with XML annotations. The only knowledge required for porting
Amilcare to new applications or domains is the ability of manually
annotating the information to be extracted in a training corpus. No
knowledge of Human Language technology is necessary. Adaptation
starts with the de nition of a tagset for annotation. Then users have
to manually annotate a corpus for training the learner. As will be later
explained in detail, OntoMat may be also used as the annotation
interface to annotate texts in a user friendly manner. OntoMat provides
user annotations as XML tags to train the learner. Amilcare's learner
induces rules that are able to reproduce the text annotation.</p>
      <p>Amilcare can work in two modes: training, used to adapt to a new
application, and extraction, used to actually annotate texts.</p>
      <p>
        In both modes, Amilcare rst of all preprocesses texts using
Annie, the shallow IE system included in the Gate package ([
        <xref ref-type="bibr" rid="ref22">22</xref>
        ],
www.gate.ac.uk). Annie performs text tokenization (segmenting
texts into words), sentence splitting (identifying sentences) part of
speech tagging (lexical disambiguation), gazetteer lookup
(dictionary lookup) and named entity recognition (recognition of people
and organization names, dates, etc.).
      </p>
      <p>
        When operating in training mode, Amilcare induces rules for
information extraction. The learner is based on , a covering
algorithm for supervised learning of IE rules based on Lazy-NLP [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This is a wrapper induction methodology [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] that, unlike other
wrapper induction approaches, uses linguistic information in the rule
generalization process. The learner starts inducing wrapper-like rules
that make no use of linguistic information, where rules are sets of
conjunctive conditions on adjacent words. Then the linguistic
information provided by Annie is used in order to generalise rules:
conditions on words are substituted with conditions on the linguistic
information (e.g. condition matching either the lexical category, or the
class provided by the gazetteer, etc. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). All the generalizations are
tested in parallel by using a variant of the AQ algorithm [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and the
best k generalizations are kept for IE. The idea is that the
linguisticbased generalization is used only when the use of NLP information is
reliable or effective. The measure of reliability here is not linguistic
correctness (immeasurable by incompetent users), but effectiveness
in extracting information using linguistic information as opposed to
using shallower approaches. Lazy NLP-based learners learn which is
the best strategy for each information/context separately. For
example they may decide that using the result of a part of speech tagger
is the best strategy for recognising the location in holiday
advertisements, but not to spot the hotel address. This strategy is quite
effective for analysing documents with mixed genres, quite a common
situation in web documents [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The learner induces two types of rules: tagging rules and
correction rules. A tagging rule is composed of a left hand side,
containing a pattern of conditions on a connected sequence of words, and a
right hand side that is an action inserting an XML tag in the texts.
Each rule inserts a single XML tag, e.g. /hotel . This makes the
approach different from many adaptive IE algorithms, whose rules
recognize whole pieces of information (i.e. they insert both hotel
and /hotel , or even multi slots. Correction rules shift misplaced
annotations (inserted by tagging rules) to the correct position. They
are learnt from the mistakes made in attempting to re-annotate the
training corpus using the induced tagging rules. Correction rules are
identical to tagging rules, but (1) their patterns match also the tags
inserted by the tagging rules and (2) their actions shift misplaced tags
rather than adding new ones. The output of the training phase is a
collection of rules for IE that are associated to the speci c scenario.</p>
      <p>
        When working in extraction mode, Amilcare receives as input
a (collection of) text(s) with the associated scenario (including the
rules induced during the training phase). It preprocesses the text(s)
by using Annie and then it applies its rules and returns the original
text with the added annotations. The Gate annotation schema is used
for annotation [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>
        Amilcare is designed to accommodate the needs of different user
types. While naive users can build new applications without
delving into the complexity of Human Language Technology, IE experts
are provided with a number of facilities for tuning the nal
application. Induced rules can be inspected, monitored and edited to
obtain some additional accuracy, if needed. The interface also allows
balancing precision (P) and recall (R). The system is run on an
annotated unseen corpus and users are presented with statistics on
accuracy, together with details on correct matches and mistakes (using
the MUCscorer [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and an internal tool). Retuning the P&amp;R balance
does not generally require major retraining. Facilities for inspecting
the effect of different P&amp;R balances are provided. Although the
current interface for balancing P&amp;R is designed for IE experts, we have
plans for enabling also naive users [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-6">
      <title>Synthesizing S-CREAM</title>
      <p>In order to synthesize S-CREAM out of the existing frameworks
CREAM and Amilcare, we consider their core processes in terms
of input and output, as well as the process of the yet unde ned
SCREAM. Figure 2 surveys the three processes.</p>
      <p>
        The rst process is indicated by a circled M. It is manual
annotation and authoring of metadata, which turns a document into
relational metadata that corresponds to the given ontology (as sketched
in Section 2 and described in detail in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) For instance, an annotator
may use OntoMat to describe that on the homepage of hotel Zwei
Linden (cf. Figure 1) the relationships listed in Table 1(a) show up.
      </p>
      <p>
        The second process is indicated by a circled A1. It is information
extraction, e.g. provided by Amilcare [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which digests a document
and produces either a XML tagged document or a list of XML tagged
text snippets (cf. Table 1(b)).
      </p>
      <p>The obvious questions that come up at this point are: Is the result
of Table 1(b) equivalent to the one in Table 1(a)? How can Table 1(b)
be turned into the result of Table 1(a)? The latter is a requirement for
the Semantic Web.</p>
      <p>The Semantic Web answer to this is: The difference between
A1
Document IE
tagged</p>
      <sec id="sec-6-1">
        <title>Output</title>
        <p>A2
DR</p>
      </sec>
      <sec id="sec-6-2">
        <title>Hotel</title>
      </sec>
      <sec id="sec-6-3">
        <title>City</title>
      </sec>
      <sec id="sec-6-4">
        <title>Hotel</title>
      </sec>
      <sec id="sec-6-5">
        <title>City</title>
        <p>A3</p>
      </sec>
      <sec id="sec-6-6">
        <title>Thing</title>
        <p>region</p>
      </sec>
      <sec id="sec-6-7">
        <title>City</title>
        <p>accommodation
located_at
Table 1(a) and Table 1(b) is analogous to the difference between an
RDF structure and a very particular serialization of data in XML.
This means that assuming a very particular serialization of
information on Web pages, the Amilcare tags can be speci ed so precisely5
that indeed Table 1(b) can be rather easily mapped into Table 1(a).
The only requirement may be a very precise speci cation of tags, e.g.
43,46 may need to be tagged as
lowerprice-of-doublebedroomof-hotel 43,46 /lowerprice-of-doubleroom-of-hotel in order to
cope with its relation to a doubleroom of a hotel.</p>
        <p>The Natural Language Analysis answer to the above questions
is: Learnable information extraction approaches like Amilcare do not
have an explicit discourse model for relating tagged entities  at
least for now. Their implicit discourse model is that each tag
corresponds to a place in a template6 and every document (or document
analogon) corresponds to exactly one template. This is ne as long as
the discourse structures in the text are simple enough to be mapped
into the template and from the template into the target RDF structure.</p>
        <p>In practice, however, the assumption that the underlying graph
structures/
discourse structures are quite similar, often does not hold. Then the
direct mapping from XML tagged output to target RDF structure
becomes awkward and dif cult to do.</p>
        <p>
          The third process given in Figure 2 is indicated by the composition
of A1, A2 and A3. It bridges from the tagged output of the
information extraction system to the target graph structures via an explicit
discourse representation. Our discourse representation is based on a
very lightweight version of Centering [
          <xref ref-type="bibr" rid="ref12 ref24">12, 24</xref>
          ] and explained in the
next section.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Discourse Representation (DR)</title>
      <p>The principal task of discourse representation is to describe
coherence between different sentences. The core idea is that during the
interpretation of a text (or, more general, a document), there is
always a logical description (e.g., a RDF(S) graph) of the content that
has been read so far. The current sentence updates this logical
description by:
1. Introducing new discourse referents: I.e. introducing new
entities. E.g., nding the term `Hotel &amp; Inn Zwei Linden ' to denote
a new object.</p>
      <sec id="sec-7-1">
        <title>We abstract here from the problem of correctly tagging a piece of text.</title>
      </sec>
      <sec id="sec-7-2">
        <title>A template is like a single tuple in an unnormalized relational database table, where all or several entries may have null values.</title>
        <p>2. Resolving anaphora: I.e. describing denotational equivalence
between different entities in the text. E.g. `Hotel &amp; Inn Zwei
Linden ' and `Country inn' refers to the same object.
3. Establishing new logical relationships: I.e. relating the two
objects refered to by `Hotel &amp; Inn Zwei Linden ' and `Dobbertin'
via LOCATEDAT.
! to one of the forward-looking centers, . For instance,</p>
        <p>The problem with information extraction output is that it is not
clear what constitutes a new discourse entity. Though information
extraction may provide some typing (e.g. city Dobbertin /city ),
it does not describe whether this constitutes an attribute value (of
another entity) or an entity of its own. Neither do information
extraction systems like Amilcare treat coherence between different pieces
of tagged text.</p>
        <p>
          Grosz &amp; Sidner [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] devised centering as a theory of text
structures that separate text into segments that are coherent to each other.
The principal idea of the centering model is to express xed
constraints as well as soft rules which guide the reference resolution
process. The xed constraints denote what objects are available at all
for resolving anaphora and establishing new logical inter-sentential
relationships, while soft rules give a preference ordering to these
possible antecedents. The main data structure of the centering model is
a list of forward-looking centers, for each utterance . The
forward-looking centers constitutes a ranked list of what is
available and what is prefered for resolving anaphora and for
establishing new logical relationships with previous sentences.
        </p>
        <p>The centering model allows for relating a given entity in utterance
when reading The chef of the restaurant in Figure 1 the centering
model allows relationships with Country inn, but not with
Dobbertin.</p>
        <p>The drawback of the centering model is that, rst, it has only been
devised for full text and not for semi-structured text such as appears
in Figure 1 and, second, it often needs more syntactic information
than shallow information extraction can provide.</p>
        <p>Therefore, we use only an extremly lightweight, degraded
version of centering, where we formulate the rules on an ad hoc basis as
needed by the annotation task. The underlying ideas of the degrading
are that S-CREAM is intended to work in restricted, though
adaptable, domains. It is not even necessary to have a complete model,
because we analyse only a very small part of the text. For instance,
we analyse only the part about hotels with rooms, prices, addresses
and hotel facilities. Note that thereby, hotel facilities are found in full
texts rather than tables and not every type of hotel facility is known
singleroom Single room /singleroom
Further information that could be included is, e.g., adjacency
information, etc. Thus, one may produce Table 1(a) out of the discourse
representation from a numbered Table 2(a).</p>
        <p>The strategy that we follow here is to make simple things
simple and complex tasks possible. The experienced user will be able to
handcraft logical rules in order to de ne the discourse model to his
needs. The standard user, will only exploit the simple template
strategy. When the resulting graph structures are simple enough to allow
for the latter strategy and a simple mapping, the mapping can also be
hotel Zwei Linden /hotel
city Dobbertin /city
beforehand.</p>
        <p>
          We specify the discourse model by logical rules, the effects of
which we illustrate in the following paragraphs. Thereby, we use the
same inferencing mechanisms that we have already exploited for
supporting annotation [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], viz. Ontobroker [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>As our baseline model, we assume the single template stragey,
viz. only one type of tag, e.g. hotel , is determined to really
introduce a new discourse referent. Every other pair of tag name and
tag value is attached to this entity as an attribute lled by the tag
value. E.g. Zwei Linden is recognized as an instance of Hotel,
every other entity (like Dobbertin, etc.) is attached to this instance
resulting in a very shallow discourse representation by logical facts
illustrated in Table 2(a).7 This is probably the shallowest discourse
representation possible at all, because it does not include ordering
constraints or other soft constraints. However, it is already adequate
to map some of the relations in the discourse namespace (dr:) to
relations in the target space, thus resulting in Table 2(b). However,
given this restricted tag set, not every relation can be detected.</p>
        <p>For more complex models, we may also include ordering
information (e.g. simply by augmenting the discourse representation tuples
given in Table 2 by numbers; this may be modelled as 4-arity
predicates in F-Logic used by Ontobroker) and a set of rules that maps the
discourse representation into the target structure integrating
rules to only attach instances where they are allowed to become
attached (e.g., prices are only attached where they are allowed)
rules to attach tag values to the nearest preceding, conceptually
possible entity (thus, prices for single and double room may be
distinguished without further ado).
rules to create a new complex object when two simple ones are
adjacent, e.g., to create a rate when it founds adjacent number and
currencies.</p>
      </sec>
      <sec id="sec-7-3">
        <title>Zwei Linden DR:INSTOF Hotel</title>
      </sec>
      <sec id="sec-7-4">
        <title>Zwei Linden DR:CITY Dobbertin</title>
      </sec>
      <sec id="sec-7-5">
        <title>Zwei Linden DR:SINGLE ROOM single room</title>
      </sec>
      <sec id="sec-7-6">
        <title>Zwei Linden DR:PRICE 25,66</title>
      </sec>
      <sec id="sec-7-7">
        <title>Zwei Linden DR:CURRENCY EUR</title>
      </sec>
      <sec id="sec-7-8">
        <title>Zwei Linden DR:DOUBLE ROOM double room</title>
      </sec>
      <sec id="sec-7-9">
        <title>Zwei Linden DR:PRICE 43,46</title>
      </sec>
      <sec id="sec-7-10">
        <title>Zwei Linden DR:PRICE 46,02</title>
      </sec>
      <sec id="sec-7-11">
        <title>Zwei Linden DR:CURRENCY EUR</title>
      </sec>
      <sec id="sec-7-12">
        <title>Zwei Linden INSTOF Hotel</title>
      </sec>
      <sec id="sec-7-13">
        <title>Zwei Linden is LOCATED AT Dobbertin</title>
      </sec>
      <sec id="sec-7-14">
        <title>Dobbertin INSTOF City</title>
      </sec>
      <sec id="sec-7-15">
        <title>Zwei Linden HAS ROOM single room1</title>
        <p>single room1 INSTOF Single Room</p>
      </sec>
      <sec id="sec-7-16">
        <title>Zwei Linden HAS ROOM double room1</title>
        <p>double room3 INSTOF Double Room
(a) Discourse Representation
(b) Target Graph Structure
de ned by directly aligning relevant concepts and relations by drag
and drop, while in the general case one must write logical rules.
6</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Usage scenario</title>
      <p>This section describe a usage scenario. The rst step is the project
de nition. A domain ontology can be the basis for the annotation
of different types of documents. Likewise a certain kind of
documents can be annotated in reference to different ontologies.
Therefore a project de nes the combination of a domain ontology (e.g.
about tourism) with a certain text type (e.g. hotel homepages).
Further the user have do de ne which part of the ontology is relevant for
the learning task, e.g. which attributes of the several concepts will
be used for tagging the corpus. The mapping of the Ontology to the
Amilcare tags works as follows:
concepts: concepts are mapped by the name of the concept, e.g.
the concept with the name Hotel results in a hotel tag.
inheritance: the concepts of the ontology represents a hierarchical
structure. To emulate the different levels of conceptualization
OnO-Mat allows to map a concept in multiple tags, e.g. the concept
Hotel in company , accommodation , and hotel .
attributes: The mapping of attributes to tags is a tradeoff between
an speci c and a general naming. The speci c naming ease the
mapping to the ontology concepts but at the same time it results
in more complex extraction rules. These rules are less general and
less robust. For example a speci c naming of the attribute phone
would result in tags like hotel phone , room phone , and
person phone in comparison to the general tag phone .
Therefore the user have to decide for every attribute the adequate
accuracy of the naming, because it in uences the learning results.</p>
      <p>After the de nition of the project parameters one needs a corpus,
a set of certain type of documents, e.g. hotel homepages.</p>
      <p>If there exist already enough annotated documents in the web the
user can perform a crawl with OntoMat and collect the necessary
documents. The crawl can be limited here to documents which are
annotated with the desired ontology. If necessary the ontology
subset and the mapping to the Amilcare tags must be re-adjusted
according to the existing annotations in the crawled documents. Afterwards
the desired type of document must be checked still manually.</p>
      <p>If there are no annotated documents, one can produce the
necessary corpus with OntoMat themselves. The user have to collect and
annotate documents of a certain type by the sub-set of the ontology
that is chosen in the project de nition phase. The document are
annotated by OntoMat with RDF facts. These facts are linked by an
XPointer description to the annotated text part. Because Amilcare
needs as a corpus XML tagged les, these RDF annotations will be
transformed into corresponding XML tags according to the mapping
done in the project de nition. Only these tags are used to train. Other
Tags like HTML tags will be used as contextual information.</p>
      <p>The learning phase is executed by Amilcare, which is embedded
as a plugin into OntoMat. Amilcare processes each document of the
corpus and generates extraction rules as described in section 3.
After the training Amilcare stores the annotation rules in a certain le
which belongs to the project.</p>
      <p>Now it is possible to use the induced rules for semi-automatic
annotation. Based on the rules the Amilcare plugin produces XML
annotation results (cf. A1 in Figure 2). Here a mapping (A2) is done
from OntoMat from the at markup to the conceptual markup in
order to create new RDF facts (A3). These mapping is undertaken by
the discourse representation (cf. section 5).</p>
      <p>These mapping results in several automatic generated proposals
for the RDF annotation of the document. The user can interact with
these annotation proposals in three different ways of automation: (i)
a highlighting of the annotation candidates or (ii) interactive
suggestion of each annotation or (iii) a rst full automatic annotation of the
document and a later re nement by the user.
highlighting mode: First of all the user opens a document he
would like to annotate in the OntoMat document editor. Then the
highlighting mode marks all annotation candidates by a colored
underline. The user can decide on his own if he use this hint for an
annotation or not.
interactive mode: This mode is also meant for the individual
document processing. The interactive suggestion is a step by step
process. Every possible annotation candidate will be suggested to the
user and he can refuse, accept or change the suggestion in a dialog
window.
automatic mode: The fully automatic approach is useful if there
is a bunch of documents that needs to be annotated, so it can be done
in batch mode. All selected documents are annotated automatically.
7</p>
    </sec>
    <sec id="sec-9">
      <title>Related Work</title>
      <p>
        S-CREAM can be compared along four dimensions: First, it is a
framework for mark-up in the Semantic Web. Second, it may be
considered as a particular knowledge acquisition framework very
vaguely similar to Prote´ge´-2000[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Third, it is certainly an
annotation framework, though with a different focus than ones like Annotea
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Fourth, it produces semantic mark-up with support of
information extraction.
We know of three major systems that intensively use knowledge
markup in the Semantic Web, viz. SHOE [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], Ontobroker [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
WebKB [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. All three of them rely on knowledge in HTML pages.
They all start with providing manual mark-up by editors. However,
our experiences (cf. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) have shown that text-editing knowledge
mark-up yields extremely poor results, viz. syntactic mistakes,
improper references, and all the problems sketched in the scenario
section.
      </p>
      <p>The approaches from this line of research that are closest to
SCREAM is the SHOE Knowledge Annotator8 and the WebKB
annotation tool.</p>
      <p>
        The SHOE Knowledge Annotator is a Java program that allows
users to mark-up webpages with the SHOE ontology. The SHOE
system [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] de nes additional tags that can be embedded in the body
of HTML pages. The SHOE Knowledge Annotator is rather a little
helper (like our earlier OntoPad [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) than a full edged
annotation environment.
      </p>
      <p>
        WebKB uses conceptual graphs for representing the semantic
content of Web documents. It embeds conceptual graph statements into
HTML pages. Essentially they offer a Web-based template like
interface like knowledge acquisition frameworks described next.
The S-CREAM framework allows for creating class and property
instances to populate HTML pages. Thus it has a target roughly similar
to the instance acquisition phase in the Prote´ge´-2000 framework [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
(the latter needs to be distinguished from the ontology editing
capabilities of Prote´ge´). The obvious difference between S-CREAM
and Prote´ge´ is that the latter does not (and was not intended to)
support the particular Web setting, viz. managing and displaying Web
pages  not to mention Web page authoring. From Prote´ge´ we have
adopted the principle of a meta ontology that allows to distinguish
between different ways that classes and properties are treated.
7.3
      </p>
    </sec>
    <sec id="sec-10">
      <title>Comparison with Annotation Frameworks</title>
      <p>
        There are a number of  even commercial  annotation tools like
ThirdVoice9, Yawas [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], CritLink [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] and Annotea (Amaya) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
These tools all share the idea of creating a kind of user comment
about Web pages. The term annotation in these frameworks is
understood as a remark to an existing document. For instance, a user of
these tools might attach a note like A really nice hotel! to the name
Zwei Linden on the Web page. In S-CREAM we would design a
corresponding ontology that would allow to type the comment (an
unlinked fact) A really nice hotel into an attribute instance
belonging to an instance of the class comment with a unique XPointer at
Zwei Linden.
      </p>
      <p>
        Annotea actually goes one step further. It allows to rely on an RDF
schema as a kind of template that is lled by the annotator. For
instance, Annotea users may use a schema for Dublin Core and ll the
author-slot of a particular document with a name. This annotation,
however, is again restricted to attribute instances. The user may also
decide to use complex RDF descriptions instead of simple strings
for lling such a template. However, no further help is provided by
Amaya for syntactically correct statements with proper references.
#$ hhttttpp::////wwwwww..tchsi.rudmvodi.ecde.uc/opmrojects/plus/SHOE/KnowledgeAnnotator.html
The only other system we know that produce semantic markup
with support from information extraction is the annotation tool cited
in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. It uses information extraction components (Marmot,
Badger and Crystal) from the University of Massachusetts at Amherst
(UMass). It allows the semi-automatic population of an ontology
with metadata. We assume that this approach is more laborious than
to use Amilcare for information extraction, e.g. they had to de ne
their own verbs, nouns and abbreviations in order to apply Marmot
for a domain. Also, they have not dealt with relational metadata or
authoring concerns so far.
8
      </p>
    </sec>
    <sec id="sec-11">
      <title>Conclusion</title>
      <p>CREAM is a comprehensive framework for creating annotations,
relational metadata in particular  the foundation of the future
Semantic Web. The new version of S-CREAM presented here supports
metadata creation with the help of information extraction in
addition to all the other nice features of CREAM, like comprises
inference services, crawler, document management system, ontology
guidance/fact browser, document editors/viewers, and a meta
ontology.</p>
      <p>OntoMat is the reference implementation of the S-CREAM
framework. It is Java-based and provides a plugin interface for extensions
for further advancements, e.g. collaborative metadata creation or
integrated ontology editing and evolution. The plugin interface has
already been used by third parties, e.g. for creating annotation for
Microsoft WordTM documents. Along similar lines, we are now
investigating how different tools may be brought together, e.g. to allow
for the creation of relational metadata in PDF, SVG, or SMIL with
OntoMat.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Ciravegna</surname>
          </string-name>
          , `
          <article-title>Adaptive information extraction from text by rule induction and generalisation'</article-title>
          ,
          <source>in Proceedings of the 17th International Joint Conference on Arti cial Intelligence</source>
          (IJCAI)e, Seattle, Usa, (
          <year>August 2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Ciravegna</surname>
          </string-name>
          , `
          <article-title>Challenges in information extraction from text for knowledge management'</article-title>
          ,
          <source>IEEE Intelligent Systems and Their Applications</source>
          ,
          <volume>16</volume>
          (
          <issue>6</issue>
          ),
          <volume>88</volume>
          
          <fpage>90</fpage>
          , (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Ciravegna</surname>
          </string-name>
          , `
          <article-title>(lp) , an adaptive algorithm for information extraction from web-related texts'</article-title>
          ,
          <source>in Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining held in conjunction with 17th International Joint Conference on Arti cial Intelligence (IJCAI)</source>
          , Seattle, Usa, (
          <year>August 2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Ciravegna</surname>
          </string-name>
          and Daniela Petrelli, `
          <article-title>User involvement in adaptive information extraction: Position paper'</article-title>
          ,
          <source>in Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining held in conjunction with 17th International Joint Conference on Arti cial Intelligence (IJCAI)</source>
          , Seattle, Usa, (
          <year>August 2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Erdmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fensel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Studer</surname>
          </string-name>
          , `
          <article-title>Ontobroker: Ontology Based Access to Distributed and Semi-Structured Information'</article-title>
          , in Database Semantics: Semantic Issues in Multimedia Systems, eds., R. Meersman et al.,
          <volume>351</volume>
          
          <fpage>369</fpage>
          , Kluwer Academic Publisher, (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Denoue</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Vignollet</surname>
          </string-name>
          , `
          <article-title>An annotation tool for web browsers and its applications to information retrieval'</article-title>
          ,
          <source>in In Proceedings of RIAO2000</source>
          , Paris, (
          <year>April 2000</year>
          ). http://www.univsavoie.fr/labos/syscom/Laurent.Denoue/riao2000.doc.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Aaron</given-names>
            <surname>Douthat</surname>
          </string-name>
          , `
          <article-title>The message understanding conference scoring software user's manual'</article-title>
          ,
          <source>in 7th Message Understanding Conference Proceedings, MUC-7</source>
          , (
          <year>1998</year>
          ). http://www.itl.nist.gov/iaui/894.02/related projects/muc/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Erdmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maedche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-P.</given-names>
            <surname>Schnurr</surname>
          </string-name>
          , and Steffen Staab, `
          <article-title>From manual to semi-automatic semantic annotation: About ontology-based text annotation tools</article-title>
          .', in P. Buitelaar &amp; K. Hasida (eds).
          <source>Proceedings of the COLING 2000 Workshop on Semantic Annotation and Intelligent Content</source>
          , Luxembourg, (
          <year>August 2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Eriksson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergerson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shahar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Musen</surname>
          </string-name>
          , `Automatic generation of ontology editors',
          <source>in Proceedings of the 12th Banff Knowledge Acquisition Workshop</source>
          , Banff, Alberta, Canada, (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Fensel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Angele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Erdmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-P.</given-names>
            <surname>Schnurr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Studer</surname>
          </string-name>
          , and Andreas Witt, `On2broker:
          <article-title>Semantic-based access to information sources at the www'</article-title>
          ,
          <source>in In Proceedings of the World Conference on the WWW and Internet (WebNet 99)</source>
          , Honolulu, Hawaii, USA, (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <article-title>Reference description of the DAML+OIL (March 2001) ontology markup language</article-title>
          ,
          <year>March 2001</year>
          . http://www.daml.org/
          <year>2001</year>
          /03/reference.html.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Grosz</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Sidner</surname>
          </string-name>
          , `
          <article-title>Attention, intentions, and the structure of discourse'</article-title>
          ,
          <source>Computational Linguistics</source>
          ,
          <volume>12</volume>
          (
          <issue>3</issue>
          ),
          <fpage>175204</fpage>
          , (
          <year>1986</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Gruber</surname>
          </string-name>
          , `
          <article-title>A Translation Approach to Portable Ontology Speci cations'</article-title>
          ,
          <source>Knowledge Acquisition</source>
          ,
          <volume>6</volume>
          (
          <issue>2</issue>
          ),
          <volume>199</volume>
          
          <fpage>221</fpage>
          , (
          <year>1993</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Handschuh</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          , `
          <article-title>Authoring and annotation of web pages in cream'</article-title>
          ,
          <source>in Proc. of WWW-2002</source>
          , (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Handschuh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Maedche</surname>
          </string-name>
          , `CREAM 
          <article-title>Creating relational metadata with a component-based, ontology driven framework'</article-title>
          ,
          <source>in In Proceedings of K-Cap</source>
          <year>2001</year>
          , Victoria,
          <string-name>
            <surname>BC</surname>
          </string-name>
          , Canada, (
          <year>October 2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Siegfried</given-names>
            <surname>Handschuh</surname>
          </string-name>
          and Steffen Staab, `
          <article-title>Authoring and annotation of web pages in cream</article-title>
          .', in Proceeding of the WWW2002 - Eleventh International World Wide Web Conferenceb (to appear), Hawaii, USA, (May
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>He in and</article-title>
          J. Hendler, `
          <article-title>Searching the web with shoe', in Arti cial Intelligence for Web Search</article-title>
          .
          <source>Papers from the AAAI Workshop. WS-00- 01</source>
          , pp.
          <volume>35</volume>
          
          <fpage>40</fpage>
          . AAAI Press, (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kahan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koivunen</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <article-title>Prud'Hommeaux, and</article-title>
          <string-name>
            <given-names>R.</given-names>
            <surname>Swick</surname>
          </string-name>
          , `
          <article-title>Annotea: An Open RDF Infrastructure for Shared Web Annotations'</article-title>
          ,
          <source>in Proc. of the WWW10 International Conference. Hong Kong</source>
          , (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Nicholas</surname>
            <given-names>Kushmerick,</given-names>
          </string-name>
          `
          <article-title>Wrapper induction for information extraction'</article-title>
          ,
          <source>in Proceedings of the 15th International Joint Conference on Arti cial Intelligence (IJCAI)</source>
          , (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Luke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Spector</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rager</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Hendler</surname>
          </string-name>
          , `
          <article-title>Ontology-based Web Agents'</article-title>
          ,
          <source>in Proceedings of First International Conference on Autonomous Agents</source>
          , (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Eklund</surname>
          </string-name>
          , `
          <article-title>Embedding Knowledge in Web Documents'</article-title>
          ,
          <source>in Proceedings of the 8th Int. World Wide Web Conf. (WWW`8)</source>
          , Toronto, May
          <year>1999</year>
          , pp.
          <volume>1403</volume>
          
          <fpage>1419</fpage>
          .
          <string-name>
            <surname>Elsevier Science</surname>
            <given-names>B.V.</given-names>
          </string-name>
          , (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Diana</surname>
            <given-names>Maynard</given-names>
          </string-name>
          , Valentin Tablan, Hamish Cunningham, Cristian Ursu, Horacio Saggion, Kalina Bontcheva, and Yorick Wilks, `
          <article-title>Architectural elements of language engineering robustness'</article-title>
          ,
          <source>Journal of Natural Language Engineering  Special Issue on Robust Methods in Analysis of Natural Language Data</source>
          , (
          <year>2002</year>
          ). forthcoming.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>R.S.</given-names>
            <surname>Mickalski</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Mozetic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Lavrack</surname>
          </string-name>
          , `
          <article-title>The multi purpose incremental learning system aq15 and its testing application to three medical domains'</article-title>
          ,
          <source>in Proceedings of the 5th National Conference on Arti cial Intelligence</source>
          , Philadelphia, USA, (
          <year>1986</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Strube</surname>
          </string-name>
          and U. Hahn, `
          <article-title>Functional centering  grounding referential coherence in information structure'</article-title>
          ,
          <source>Computational Linguistics</source>
          ,
          <volume>25</volume>
          (
          <issue>3</issue>
          ),
          <volume>309</volume>
          
          <fpage>344</fpage>
          , (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vargas-Vera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Motta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Domingue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Buckingham</given-names>
            <surname>Shum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lanzoni</surname>
          </string-name>
          , `
          <article-title>Knowledge Extraction by using an Ontology-based Annotation Tool'</article-title>
          ,
          <source>in K-CAP 2001 workshop on Knowledge Markup and Semantic Annotation</source>
          , Victoria,
          <string-name>
            <surname>BC</surname>
          </string-name>
          , Canada, (
          <year>October 2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Ka-Ping Yee</surname>
          </string-name>
          .
          <article-title>CritLink: Better Hyperlinks for the WWW</article-title>
          ,
          <year>1998</year>
          . http://crit.org/ping/ht98.html.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>