<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Docuphet - a dialogue-assisted content annotation tool</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mih a´ly H e´der</string-name>
          <email>mihaly.heder@computer.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domonkos Tikk</string-name>
          <email>tikk@informatik.hu-berlin.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Budapest University of Technology and</institution>
          ,
          <addr-line>Economics</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for Computer Science, Humboldt University in Berlin</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this decade the amount of textual content stored on the web became enormous, but the basic structure of web documents remained unchanged: a mixture of text and markup. When creating a document, the user rarely has the possibility of embedding semantic annotation into the content, because editor applications do not have such a feature or they are too difficult to use. We think that the creation of semantically rich documents can be best facilitated by a content editor with text mining technology running in the background. In this paper some of these technologies are brought into spotlight.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>On leave from Dept. of Telecommunications and Media
Informatics, Budapest University of Technology and Economics
cial markup language. This requires certain skills from the user and
has a negative impact on the size of the potential audience.
Another approach is to annotate the text off-line, after the content
has been created, without human supervision. To achieve this
natural language processing (NLP) technology, namely information
extraction (IE) tools are required. Considering the huge amount of
textual data already present on the Web, these researches are very
important.</p>
      <p>Undoubtedly, it would be fruitful to provide tools for the average
user to ease the creation of semantic annotations when editing web
content. The Docuphet [9] project aims at discovering and
experimenting with the possible solutions of the problem. In our view
the web content creation should be a continuous, mutual dialogue
between human and machine rather than a simple one-time input
process.</p>
      <p>This paper is organized as follows. In the following section we
describe the motivations and goals behind Docuphet and the main
components of the system. In Section 3 the technologies of the
implemented components are detailed. Section 4 presents two
example applications of the system. In Section 5 we discuss our
experiences, and propose some requirements on dialogue-assisted
semantic annotators. Section 6 reviews the related work in the field
of semantic annotation software. Section 7 concludes the paper and
discusses the future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. THE DOCUPHET PROJECT</title>
      <p>The main goal of the Docuphet project is to create a system that
enables the user to produce semantic annotation easily. This is
achieved via an intuitive user interface that is supported by text
processing algorithms in the background. We intend to capture the
meaning of the currently edited content by addressing simple
questions to the user. While the user types, the available text is
processed by IE applets in the background. Then extracted facts are
formulated as statements that are conveyed to the user in the form
of closed-ended questions. If the user confirms a statement, the
system automatically embeds the appropriate semantic annotation
into the text. The annotator software does not require any special
technical knowledge or skill from the user, it is only assumed that
she knows the particular content that she is editing, and hence she
is able to decide if a related closed-ended question is true or not.
Another property of the system is that it stores text and
annotation together. This integration has many advantages. For instance
when the text changes, no extra look-up operations are required to
transfer the changes into the corresponding semantic annotations.
If the annotations were stored separately, the maintenance of extra
identifiers would be necessary to link text and annotation.
To make it accessible and runnable without installation, the client
part of the system runs in an ordinary web browser.</p>
      <p>It is a natural requirement against every system to support
multiple languages. This requirement was considered when developing
every component of the system. However, many IE techniques are
language specific.1 For the first applications, the Hungarian was
selected as primary language.</p>
      <p>Docuphet focuses only on the creation of annotations. The further
use the semantically annotated content, such as semantic search
and retrieval, or machine reasoning is currently out of the scope of
the project. We rendered our efforts to the design and the
validation of our concept, therefore some components of the prototypical
applications are not yet optimized for the efficiency under heavy
load.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1 The main components of Docuphet</title>
      <p>The user interacts with the Docuphet Content Editor’s (DCE) web
interface. The web application forwards the created content to the
server via AJAX calls. On the server the content is distributed to
various IE modules. The modules suggests annotation suggestions.
Each suggestion comprises a textual statement or question, a
numeric confidence level, one or more possible answers to the
question, and semantic annotations (one per each positive answer). The
confidence level is a real number of the unit interval, which
specifies the validity of the suggestion. Under a certain (configurable)
level suggestions are automatically disregarded. The module’s
suggestions are collected on the server and are sent back to the client,
where they are presented to the user in the form of pop-up
closedended questions. If the user confirms a statement the system inserts
the corresponding annotation into the content. When the user
finishes editing, the full annotated document is sent back to the server
and saved there. Figure 1 provides an overview of the system.
In the next section we discuss the components of Docuphet: the
content editor, the annotation storage, the text processing and
information extraction applets.</p>
    </sec>
    <sec id="sec-4">
      <title>3. TECHNOLOGY OVERVIEW</title>
    </sec>
    <sec id="sec-5">
      <title>3.1 The content editor</title>
      <p>There is an excess of tools for creating content on a computer.
These can be categorized in many ways. One aspect is the mode of
editing. There are WYSiWYG editors like desktop word
processors. Other, structured editors have two views: one for editing and
one for viewing the documents. At structured editors, the user
handles objects like section, title, paragraph. This category includes
wiki editors, publishing tools, DocBook editors and scientific
editors for LATEX . In these applications the final formatting is done
with style class files or style sheets.</p>
      <p>
        Another aspect is the technology of the editor. There are two main
categories: the more function-rich desktop applications that need
to be installed on the client, and the web-based editors [
        <xref ref-type="bibr" rid="ref30">11, 34</xref>
        ] that
require only a web browser to run. Web-based editors have been
formerly simpler, but as AJAX become widespread, the complexity
of such applications became almost equal to the desktop ones.
1In fact, this is one of the central problems of text processing to
provide language-independent information extraction methods
      </p>
      <p>The special requirements against DCE led to the development of a
completely new solution. The DCE is an easy to use WYSiWYN2
editor, which is capable of editing the structure of a document
without the need of learning a markup language. DCE also handles the
communication with the server, the representation of suggestions
and the integration of annotations (see also Figures 4–5 for
screenshots).</p>
    </sec>
    <sec id="sec-6">
      <title>3.2 Content storage</title>
      <p>For selecting the best of the plethora of content storing formats we
set up the following requirements:
1. simplicity to allow the implementation as web editor;
2. standard, stable and free to use;
3. proper support (documentation, examples, templates, editors,
tools);
4. extendibility to carry annotations;
5. support of the following formatting: paragraphs, sections
and titles; lists, program listings, emphasizes, images, tables,
links.</p>
      <p>
        Many options were evaluated: RTF , texinfo [
        <xref ref-type="bibr" rid="ref9">13</xref>
        ], troff [7], wikitext
[42], XHTML, DocBook, DITA [
        <xref ref-type="bibr" rid="ref21">25</xref>
        ], LATEX , ODF, and CDF [
        <xref ref-type="bibr" rid="ref34">38</xref>
        ].
Because of the large variety of convenient XML processing tools
we dropped the non-XML formats. We also dropped ODF because
of its high complexity. DITA and CDF (WCID) are too specific
for our purposes. From the remaining two candidates we decided
2“What You See is What You Need” editors let the user to edit the
structure of the document but not the source markup directly. They
differ from WYSiWYG editors because further transformations are
applied to generate the final view of the document.
in favour of DocBook, because this format is purely structural, it
doesn’t contain any markup relating to the document formatting,
and it’s grammar is defined in an easy-to-subtype Relax NG [
        <xref ref-type="bibr" rid="ref22">26</xref>
        ]
format.
      </p>
    </sec>
    <sec id="sec-7">
      <title>3.3 Storing semantic annotations in the content</title>
      <p>
        Many possible technologies were evaluated for storing semantic
annotations in a DocBook document: HTML Metadata, RDF XML
[
        <xref ref-type="bibr" rid="ref35">39</xref>
        ], GRDDL [
        <xref ref-type="bibr" rid="ref10">14</xref>
        ], Microformats and RDFa [
        <xref ref-type="bibr" rid="ref36">40</xref>
        ]. Finally we have
selected RDFa because of its many advantages.
      </p>
      <p>
        The RDFa [
        <xref ref-type="bibr" rid="ref36">40</xref>
        ] has been developed by W3C and by now it is a
W3C recommendation. RDFa offers a technique that transforms an
arbitrary part of an XML document into an RDF triple. The
technology is primarily aimed at annotating XHTML documents but
also capable of handling XML documents from other namespaces
(an example is depicted in Figure 2):
&lt;article
xmlns="http://docbook.org/ns/docbook"
xmlns:dc="http://purl.org/dc/elements/1.1/"&gt;
&lt;title property="dc:title"&gt;
      </p>
      <p>The trouble with Bob
&lt;/title&gt;
&lt;para id="ch1" property="dc:creator"&gt;</p>
      <p>Alice
&lt;/para&gt;
&lt;para about="#ch1" property="dc:creator"&gt;</p>
      <p>Tom
&lt;/para&gt;
...
&lt;/div&gt;
RDFa was also favoured because it is easy to integrate with other
XML namespaces, like Docbook. RDFa allows to annotate every
part of a document, while it is still relatively easy to retrieve the
RDF triples from the XML. To conform DocBook with RDFa, we
extended the Docbook RNG schema to carry RDFa annotations,
and we termed this DocBook profile DocBook/RDFa.</p>
    </sec>
    <sec id="sec-8">
      <title>3.4 Extracting semantic information from the text</title>
      <p>3.4.1 Named Entity Recognition
A named entity (NE) is a natural textual identifier of an object, such
as e.g. person names, names of companies, locations, names of
products, addresses, telephone numbers, email addresses. Named
entity recognition (NER) is an NLP task, which aims at identifying
and classifying NEs in the text.</p>
      <p>The difficulty of the recognition task depends on the type of NEs.
Telephone numbers, e-mail addresses can be easily recognized with
simple regular expressions. The recognition of personal names and
locations is more difficult, but it can be effectively supported with
appropriate vocabularies. The recognition of company or
product names can be much harder, because essentially no constraint
applies on their surface form. NER is primarily performed by
analysing some features of the candidate NEs, e.g. surface clues
(capitalization, numbers, special symbols), the frequency, the
insentence, in-paragraph, and in-document positions, whereas
grammatical and morphological analysis may also be applied.
The Docuphet framework contains a general purpose named
entity recognizer, the JNER, implemented in Java. This component
comprises several modules, all of them analysing the text with a
different technique. Many of them use vocabularies, e.g., for given
names, company suffixes, locations. Others are based on regular
expressions. It is possible to plug in external tools, such as a
stemmer or a morphological parser. Other external tools can serve as
connectors to databases (i.e IMDB, DMOZ) or to search engines
(google, wikia search).</p>
      <p>Although usually not considered as NEs, the recognition of
professions (like painter, composer, engineer) and human properties
(blond, tall) are also supported in JNER.</p>
      <p>Conforming with the philosophy of Docuphet (ask relevant
questions from the user), in this system the NER can also be assisted by
asking closed-ended questions from the user.</p>
      <sec id="sec-8-1">
        <title>3.4.2 Information Frame Recognition</title>
        <p>Let us define an information frame (IF) as an RDF triple (we use
Notation 3 syntax of RDF in this article for brevity)
&lt;subject&gt;
&lt;predicate&gt;
&lt;object&gt; .
in which at least one value is missing and thus substituted by a
variable name. The class of the missing component(s) may be known.
An example IF:
&lt;X (an instance of the person class)&gt;
&lt;location of birth&gt;
&lt;Y (an instance of the location class)&gt; .
The information frame recognition (IFR) means the recognition of
instances of an IF in the text by identifying the missing components
of the triple. For instance in the sentence “Pat Nixon was born in
Nevada in 1912”, we can recognize an instance of the above IF:
&lt;Pat Nixon&gt;
&lt;location of birt&gt;
&lt;Nevada&gt; .</p>
        <p>
          IFs can be defined in various ways. One method is to derive them
from “semantic frames” which are used in the Berkeley FrameNet
project [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] (see an example below). The aim of this project is to
create an annotated lexical resource for English, by using frame
semantics and supported by corpus evidence. These frames are
referring to pre-defined conceptual structures.
frame(DESIGN),
inherit(CREATE),
frame_elements(DESIGNER(=CREATOR),
        </p>
        <p>BUILDING(=WORK)),
scenes(DESIGNER desings BUILDING)
The frame elements are referring to certain semantic roles of actors
and objects present in the scene. In the FrameNet project, human
annotators mark the occurrences of the frame elements in texts.
Evidently, to find the possible semantic role of a text element, it is
very helpful to have the named entities and their types identified
beforehand. In some cases additional rules may also be useful,
i.e. specifying constraints on the relative position of the element
of an IF (in-sentence, in-paragraph). While limiting the number
of potentially identifiable IFs, this constraint has many practical
advantages: it enables to start the IFR process before the whole
content is available, and narrows significantly the search space thus
reducing the computation time.</p>
        <p>JFrame is the IFR component of the Docuphet framework, written
in Java. IFs can be defined and the corresponding recognition rules
can be given in JFrame. The input of the module is a token stream,
in which the recognized NEs and their types are already marked.
A JFrame IF definition may contain rules related to the class and
lemmas of NEs in the token stream and have conditions on their
order. JFrame also provides a confidence level for every recognized
IF instance in accordance to the concept of Docuphet’s workflow.
Currently in Docuphet, the IFR rules are defined by hand after
experimentation. This is necessary because:</p>
        <p>There isn’t enough properly annotated text which can be used
as training data.</p>
        <p>The annotation method in Docuphet is based on dialogues.
Therefore, for each IF a simple function must be defined,
which generates the corresponding closed-ended questions.</p>
      </sec>
      <sec id="sec-8-2">
        <title>3.4.3 Sentence segmentation</title>
        <p>
          To support the recognition of IFs, a sentence segmenter was
developed, named as JSentence. The component implements a
modified version of the algorithm described in [
          <xref ref-type="bibr" rid="ref29">33</xref>
          ]. The Hungarian
configuration of the tool was tested on the Szeged2 corpus [
          <xref ref-type="bibr" rid="ref27">31</xref>
          ],
the biggest multi-thematic text corpus in Hungarian. The corpus
contains 82096 sentences, and consists of complete novels from
various authors and genre, high school essays, general
newspaper articles, legal texts, computer related handbooks, and economic
news. JSentence recognized sentence boundaries with a precision
of 99.06 % at F P F N . JSentence uses a rule-based algorithm
with 15 regular expression based rules and 4 abbreviation lists.
        </p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4. APPLICATION EXAMPLES</title>
      <p>
        In this section two demo applications of Docuphet are presented.
BioBase is a web site to collect biographies of known people
(similar to the ones in [
        <xref ref-type="bibr" rid="ref14">18</xref>
        ]), user autobiographies, or simple
self-introductory texts. FlatBase is a real estate advertisement portal. In the
case of the biographies, Docuphet is configured to recognize IFs
based only on the entered text. For the FlatBase portal, the
entered text is analysed first, and when some relevant information are
still missing (e.g. floor number) Docuphet automatically asks
questions from the user to complete the advertisement database
properly. Both applications are configured to work on Hungarian text.
The workflow is the same in both scenarios as described in
Section 2.1. Both applications uses DCE and the same document server
component versions. They differ in the way how the annotations are
produced, thus we discuss this part in detail next.
      </p>
    </sec>
    <sec id="sec-10">
      <title>4.1 BioBase</title>
      <p>In BioBase a two layered IF recognition has been implemented. In
the first layer NEs are identified as follows:
1. JNER analyses the text configured with all the NER rules
available for Hungarian language. Currently this includes
person name recognition (male and female distinguished)
based on regular expressions and given name vocabulary,
location, nationality, profession, education, residence, family
status, social relations (friends/other related people)
recognition based on vocabularies, email and phone number
recognition based on regular expressions and date recognition based
on a custom Java component, regular expressions and a list
of month names.
2. JNER assigns a confidence level to every recognized NE,
which is calculated by summing the pre-defined values of
matching rules.3 Over confidence level 0.95, an annotation
suggestion is created with the following RDF triple:
&lt;article id&gt;
&lt;related to [NE type]&gt;
&lt;[NE value]&gt;.
where NE type and NE value are substituted with the actual
values. The annotation suggestions also have level of
confidence4 that is set to 0.95. DCE accepts every annotation
above 0.9 without asking a confirming question from the
user, in order to avoid of flushing the user with questions.5
The target of the annotations (the location where the
annotation is placed) is the node before the given paragraph. From
these annotations, a tag cloud can be generated with typed
(see also Figure 4).
3. For the top three person and location NEs with confidence
level between 0.8 and 0.95, an annotation suggestion of value
3The creator of a rule can set both positive and negative confidence
coefficients.
4This is based on the NE confidence level, but can be weighted, e.g.
when a candidate person NE shares the surname with an already
recognized NE.
5Every annotation can be easily removed by deletion or with the
undo action.
0.8 is created with the question “This article is related to the
person/location (value)” with the same target and RDF triple
as in the previous step.</p>
      <p>Based on the NEs recognized, the following IE attempts are made:
1. First, the person is identified whom the article goes. This is
done by creating suggestions on the first person NE found
(in the title or in the text) with the triple
&lt;article id&gt;
&lt;describes the life of&gt;
&lt;[NE value]&gt;.
where the target is the beginning of the article, the confidence
level is 0.8. This is repeated with the subsequent person NEs
until a positive answer is given by the user. (In our
experience the article is nearly always about the first mentioned
person).
2. If the subject of the article is known, the date and place of
birth and death are attempted to be identified next. This is
done by analysing the date and location tokens in range of
a centered window of 15 tokens around the occurrences the
identified person. Extra confidence is added if the verb is
in past tense form “szu¨letett” (was born), “elhunyt j meghalt”
(died) around a particular date or location (these are often
omitted however).
3. Nationality, profession, education, residence, friends and
parent’s name are identified in a similar way as described in the
previous point, involving certain terms (his mother j father)
where appropriate.</p>
      <p>
        Using these annotations the persons can be categorized by the era,
which of they lived or live in, the profession, the nationality or
location. All the RDF properties used in the process are in BioBase’s
namespace, but if required, they can be easily mapped into other
namespaces, like FOAF [
        <xref ref-type="bibr" rid="ref8">12</xref>
        ].
      </p>
      <p>Our experience with BioBase shows that Docuphet is able to
capture the basic biographic information about a person. When
creating a 500 word long biography (see Figure 4), the system pops up
10–15, mostly adequate questions. One direction of further work
could be the to addition of new event IFs based on the review of
biographies.</p>
    </sec>
    <sec id="sec-11">
      <title>4.2 FlatBase</title>
      <p>IFR performed in FlatBase also in two layers, but in a completely
different way than in BioBase. On the first level, the main
informaOn both levels, specially configured JNER instances are used. The
configuration initially includes a shorter list of locations, regular
expression based rules to recognize price, size, and contact
information. IFR is then performed based on already known NEs and
certain trigger words related to building materials, heating types,
etc. When some information is extracted, the JNERs may be
reconfigured accordingly (e.g. loading the corresponding quarters when
a district of Budapest is found).</p>
      <p>When some of the main information are missing from the ad by
saving, the user is asked to complete.6 If this happens again, a
limited number of direct questions are formulated related to the
missing information, every question only once. At the end, the ad
is saved even if some information are still missing.</p>
      <p>FlatBase has proven to be very effective because flat advertisements
are usually short, very similar to each other, and their vocabulary
is limited. As a result, Flatbase is capable of extracting nearly all
(usually not more than 5–10, see Figure 5) important information
from an advertisement. This suggest that FlatBase may become a
user-friendly input-assisting tool for advertisement portals.</p>
    </sec>
    <sec id="sec-12">
      <title>DISCUSSION</title>
      <p>6But the actual content is stored anyway.</p>
      <p>
        Bodain and Robert composed five requirements on the static
properties of semantic annotations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]:
of text, then many suggestions may be generated. These must
be asked in several turns.
robust anchors,
transparency,
freedom in choosing the semantic vocabulary,
variable granularity,
handling dynamic updates.
      </p>
      <p>
        For dialogue-assisted annotators, such as Docuphet, most of the
above requirements can be carried over7, but as an outcome of our
experimentation, we can now formulate additional requirements on
the semantic annotation creation:
1. A particular question must never be asked twice to avoid of
the discontent of the user. This implies that:
2. Every suggestion must be stored in the document, regardless
from the answer, if any. Per-session storage is not sufficient
since the document can be edited in several sessions. Further
research is needed to find out whether suggestions should be
stored on a per-document or a per-user basis.
3. There are two main types of suggestions depending on the
positive/negative answer of the user. As a future work it
should be investigated how negative answers can be exploited
at annotation.
4. According to the “handling dynamic updates” rule described
in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the annotations must be re-validated upon every text
change. Given our point 1, this requirement can hardly be
met. Theoretically it is possible to insert a statement into
the first part of a document which negates the meaning of
everything in a given scope. To handle this appropriately, we
should either fully capture the meaning of the change and
update the right annotations or, re-ask every question. The
first solution is yet not possible to carry out, the latter causes
a high number of undesirable questions. To get around this
problem, we devised two techniques:
      </p>
      <p>Limited scope: We have defined two types of
annotations: basic and derived. Basic annotations regard only
to a specific text scope (a title, a paragraph, list item).
We assume that changes outside the scope do not
affect them. To fulfil this assumption, basic annotations
are very simple e.g. “This paragraph is related to
Budapest” when the token Budapest is present. These
annotations are usually retrieved by NER as described in
Section 4.1, and re-validated upon text change in their
scope.</p>
      <p>Dependencies: We define a dependency graph of
annotations. Derived annotations depend on basic or other
derived annotations. The dependency is tracked with a
list of annotation IDs. Every time an annotation changes,
its dependants should be re-validated. If an annotation
is deleted, the derived annotations must be deleted as
well. Furthermore, a derived annotation may require
a re-validation when non-annotated parts of the text
change, since some derived annotations may depend on
the characteristics of the text or on missing annotation.
For example if the user accepts a new “main category”
annotation, the old one must be re-validated or simply
deleted.
5. The number of questions asked together has to be limited.</p>
      <p>This is important because if the user pastes in a larger piece
7although we applied a pre-defined semantic vocabulary because
of the nature of the system</p>
    </sec>
    <sec id="sec-13">
      <title>6. RELATED WORK</title>
      <p>In the last 15 years, an excess of semantic annotation tools had
been developed. Here we recall and compare the most important
ones (see also Table 1).</p>
      <p>
        We can divide the semantic annotators into two main groups:
Semantic Wikis: One form of inserting semantic
annotations into documents is via semantic wikis, such as Semantic
Mediawiki [
        <xref ref-type="bibr" rid="ref19">23</xref>
        ], Artificial Memory [
        <xref ref-type="bibr" rid="ref17">21</xref>
        ], Kaukolu [8],
PHPWiki [
        <xref ref-type="bibr" rid="ref32">36</xref>
        ], IkeWiki [
        <xref ref-type="bibr" rid="ref26">30</xref>
        ], and SWiM [
        <xref ref-type="bibr" rid="ref15">19</xref>
        ]. These applications
enable the user to input RDF data by using a special
syntax. The available semantic vocabulary and the granularity
of the annotations vary in these applications, but in all cases
semantic handling skill is required from the user.
      </p>
      <p>
        Desktop ontology builders and annotators: This group
contains some feature-rich desktop annotators for authoring
semantically annotated documents. Prote´ge´ [
        <xref ref-type="bibr" rid="ref24">28</xref>
        ], TopBraid
[
        <xref ref-type="bibr" rid="ref31">35</xref>
        ], Amaya [
        <xref ref-type="bibr" rid="ref25">29</xref>
        ], and Mangrove [
        <xref ref-type="bibr" rid="ref18">22</xref>
        ] are frameworks for
building ontologies and knowledge graphs. SWEDT [
        <xref ref-type="bibr" rid="ref23">27</xref>
        ],
Apolda [
        <xref ref-type="bibr" rid="ref37">41</xref>
        ], and KATIA [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] have rich document editing and
annotating capabilities. These are professional tools for
knowledge experts.
      </p>
      <p>
        S-CREAM [
        <xref ref-type="bibr" rid="ref11">15</xref>
        ] integrates the Amilcare [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] IE module that
implements a semi-supervised machine learning method: a
set of training data must be annotated by hand in advance,
then Amilcare creates certain annotations automatically in
new documents.
      </p>
      <p>
        COHSE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] highlights text and provide additional
information for strings matching elements of a pre-defined
knowledge base. Magpie [
        <xref ref-type="bibr" rid="ref7">10</xref>
        ] allows annotation of a pre-defined
set of concepts based on forms.
      </p>
      <p>
        Docuphet has many in common in the visualization of annotations
with SWEDT [
        <xref ref-type="bibr" rid="ref23">27</xref>
        ] and KATIA [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. But unlike these tools,
Docuphet’s editor hides the annotation markup details from the user and
provides annotation visualization instead.
      </p>
      <p>
        Like Melita [
        <xref ref-type="bibr" rid="ref28">32</xref>
        ], AKTiveDoc [
        <xref ref-type="bibr" rid="ref16">20</xref>
        ], MnM [
        <xref ref-type="bibr" rid="ref33">37</xref>
        ], or S-CREAM [
        <xref ref-type="bibr" rid="ref11">15</xref>
        ]
Docuphet also uses IE technology to extract semantic annotation
candidates from the text. However, the way Docuphet uses IE is
quite dissimilar from these tools, since it uses IFs as a common
concept of semantic information and involves the user into the
process.
      </p>
      <p>Like COHSE and Magpie, Docuphet is based on pre-defined
concepts and relations that are termed information frames. The
targeted audience of unskilled users and the NLP- and IE-based
dialogue-assisted annotation creation render our solution rather unique.
Our approach is comparable to the question-sequence-based
guidance provided by some complex installation wizards, where the
questions are based on the information gathered earlier. In
Docuphet the set of IFR elements represents the knowledge base, which
provides a sophisticated, flexible and order-independent solution.</p>
    </sec>
    <sec id="sec-14">
      <title>7. CONCLUSION AND FUTURE WORK</title>
      <p>Docuphet is dialogue-assisted semantic text annotator. The
computer–human dialogue is facilitated by IE techniques: named
entity recognition and information frame extraction. In Docuphet—
unlike some semantic wikis—it is not possible to annotate the text
aIt collects all named entities as potential instances of ontology classes
and build the corresponding ontology simultaneously. Therefore
Docuphet is only capable of handling pre-defined RDF information
triples, which limits the flexibility of the system. On the other hand,
this very property allows to compose easy-to-understand questions
about the known triples, as the questions are defined together with
the corresponding IFs. This way it is easy to create annotations
even for the completely uninitiated users.</p>
      <p>Given these properties, Docuphet is most useful when the domain
of the text is known in advance. Two exemplary applications were
presented in Section 4. Other possible applications include
annotation of economic or sport news, product reviews or geolocation
reviews. In these cases the set of appropriate IFs and corresponding
IFR rules have to be created in advance.</p>
      <p>
        Despite these limitations, there is a basic functionality available
without specific domain knowledge. Docuphet is capable of
recognizing NEs in an arbitrary text, and formulating questions about
the NE candidates. This makes it a very useful tool for building
NE databases and for disambiguation applications. Another
possible application area is the assistance of context sensitive browser
applications tools, such as In4’s iGlue [
        <xref ref-type="bibr" rid="ref13">17</xref>
        ] and Context Discovery
Inc.’s Context Organizer [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>As for the future work, we intend to enable Docuphet to access
and edit wikipedia articles via the interface provided by the
MediaWiki’s public API. As wikipedia uses the wikitext format, being
very different from Docbook/RDFa, apparently the most
problematic is the conversion of the articles, and the placement of the the
annotations in wikitext. We also plan to integrate Docuphet with
large public databases like IMDB, to facilitate disambiguation and
named entity recognition.</p>
      <p>We think that in machine understanding, bidirectional
communication — questions and answers — is a key element — just like
in human understanding. However, we admit that if the questions
are not relevant enough, this proactive behavior probably causes
discontent on the user’s part. To find out more about the users’
reactions when using our system, we intend to conduct experiments
and surveys with many users.</p>
      <p>
        The relevance of the questions can be improved if topical category
labels are available for the documents. Therefore we plan to
prepare the Docuphet to collaborate with document classifiers, such as
e.g. the hitec3 framework [
        <xref ref-type="bibr" rid="ref12">16</xref>
        ].
      </p>
    </sec>
    <sec id="sec-15">
      <title>Acknowledgement</title>
      <p>Domonkos Tikk was supported by the Alexander von Humboldt
Foundation.
[7] R. Corderoy. troff. www.troff.org/.
[8] DFKI Knowledge Management. Kaukolu.</p>
      <p>www.dfki.de/web/forschung/km/.
[9] The Docuphet project. www.docuphet.net.
[11] FCKEditor. www.fckeditor.net/.
[42] wikitext. en.wikipedia.org/wiki/Wikitext.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Collin</surname>
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>Charles J.</given-names>
          </string-name>
          <string-name>
            <surname>Fillmore</surname>
            ,
            <given-names>and John B. Lowe.</given-names>
          </string-name>
          <article-title>The berkeley framenet project</article-title>
          .
          <source>In Proceedings of the 17th international conference on Computational linguistics</source>
          , pages
          <fpage>86</fpage>
          -
          <lpage>90</lpage>
          , Morristown, NJ, USA,
          <year>1998</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bechhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Goble</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Carr</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Kampa</surname>
          </string-name>
          . COHSE:
          <article-title>Semantic web gives a better deal for the whole web</article-title>
          ?
          <source>ISWC International Semantic Web Conference Poster</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bodain and J.-M. Robert</surname>
          </string-name>
          .
          <article-title>Developing a robust authoring annotation system for the semantic web</article-title>
          .
          <source>Proc. of 7th IEEE Int. Conf. on Advanced Learning Technologies</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ciravegna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dingli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wilks</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Petrelli</surname>
          </string-name>
          .
          <article-title>Amilcare: adaptive information extraction for document annotation</article-title>
          .
          <source>SIGIR'02: Proc. of the 25th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval</source>
          , pages
          <fpage>367</fpage>
          -
          <lpage>368</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Citizendium</surname>
          </string-name>
          . en.wikipedia.org/wiki/Citizendium.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Context</given-names>
            <surname>Discovery</surname>
          </string-name>
          <article-title>Inc. Context Organizer for the Web</article-title>
          . http://www.contextdiscovery.
          <article-title>com/ context-organizer-for-the-web</article-title>
          .aspx .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Domingue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dzbor</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Motta</surname>
          </string-name>
          .
          <article-title>Semantic layering with magpie</article-title>
          .
          <source>In Handbook on Ontologies</source>
          , pages
          <fpage>533</fpage>
          -
          <lpage>554</lpage>
          . Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [12]
          <article-title>The Friend Of A Friend project</article-title>
          .
          <source>www.foaf-project.org/ .</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Free</given-names>
            <surname>Software</surname>
          </string-name>
          <article-title>Foundation</article-title>
          . texinfo. www.gnu.org/software/texinfo/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [14] GRDDL Working Group.
          <article-title>Gleaning resource descriptions from dialects of languages. www</article-title>
          .w3.org/TR/grddl/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Handschuh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Ciravegna</surname>
          </string-name>
          .
          <article-title>S-cream-semi-automatic creation of metadata</article-title>
          .
          <source>Proc. of the European Conf. on Knowledge Acquisition and Management</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>[16] HITEC. categorizer.tmit.bme.hu/trac/wiki.</mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>In4</given-names>
            <surname>Ltd</surname>
          </string-name>
          .
          <article-title>The iGlue project</article-title>
          . http://iglue.com/beta/.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A</given-names>
            <surname>´. Kenyeres</surname>
          </string-name>
          . Magyar E´
          <article-title>letrajzi Lexikon ((Hungarian Biography Encyclopedia))</article-title>
          .
          <source>Arcanum Adatba´zis Kft</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [19] KWARC.
          <article-title>SWiM: A semantic wiki for mathematical knowledge management. kwarc</article-title>
          .info/projects/swim/.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Vitaveska</surname>
            <given-names>Lanfranchi</given-names>
          </string-name>
          , Fabio Ciravegna, Phil Moore, and
          <string-name>
            <given-names>Daniela</given-names>
            <surname>Petrelli</surname>
          </string-name>
          .
          <article-title>Document editing and browsing in aktivedoc</article-title>
          .
          <source>In DocEng '05: Proceedings of the 2005 ACM symposium on Document engineering</source>
          , pages
          <fpage>237</fpage>
          -
          <lpage>238</lpage>
          , New York, NY, USA,
          <year>2005</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ludwig</surname>
          </string-name>
          .
          <article-title>Artificial memory</article-title>
          .
          <source>www.artificialmemory.net/.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>L.</given-names>
            <surname>McDowell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Gribble</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Halevy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Pentney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Verma</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vlasseva</surname>
          </string-name>
          . Mangrove:
          <article-title>Enticing ordinary people onto the semantic web via instant gratification</article-title>
          .
          <source>Proc. of International Semantic Web Conference</source>
          , pages
          <fpage>754</fpage>
          -
          <lpage>770</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Semantic</given-names>
            <surname>Mediawiki.</surname>
          </string-name>
          semantic-mediawiki.org/wiki/ Semantic\%
          <fpage>5FMediaWiki</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Nupedia</surname>
          </string-name>
          . en.wikipedia.org/wiki/Nupedia.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [25]
          <string-name>
            <surname>OASIS DITA Technical</surname>
          </string-name>
          <article-title>Committee</article-title>
          .
          <article-title>Darwin information typing architecture. www.oasis-open</article-title>
          .org/committees/dita .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [26]
          <article-title>OASIS Relax-NG committee</article-title>
          .
          <article-title>Relax-ng. www.oasis-open</article-title>
          .org/committees/relax-ng .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>R. G.</given-names>
            <surname>Pereira and M. M.</surname>
          </string-name>
          <article-title>Freire</article-title>
          .
          <article-title>SWedt: A semantic web editor integrating ontologies and semantic annotations with resource description framework</article-title>
          .
          <source>IEEE Int. Conf. on Internet and Web Applications and Services</source>
          , pages
          <fpage>200</fpage>
          -
          <lpage>200</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>[28] Prote´ge´. protege.stanford.edu/.</mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>V.</given-names>
            <surname>Quint</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Vatton.</surname>
          </string-name>
          <article-title>An introduction to Amaya</article-title>
          . Wide Web J.,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Salzburg</given-names>
            <surname>Research</surname>
          </string-name>
          . IkeWiki. ikewiki.salzburgresearch.at/.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>Szegedi</given-names>
            <surname>Tudoma</surname>
          </string-name>
          <article-title>´nyegyetem Nyelvtechnolo´giai Csoport. Szeged korpusz 2</article-title>
          . www.inf.u-szeged.hu/projectdirs/hlt/ .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Advanced</surname>
            <given-names>Knowledge</given-names>
          </string-name>
          <string-name>
            <surname>Technologies</surname>
          </string-name>
          . Melita. howhttp://www.aktors.org/technologies/melita/.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tikk</surname>
          </string-name>
          .
          <article-title>Szo¨vegba´nya´szat, chapter 2</article-title>
          . TypoTEX,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>Tiny</given-names>
            <surname>Moxiecode Content</surname>
          </string-name>
          <article-title>Editor (TinyMCE). tinymce</article-title>
          .moxiecode.com/.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [35]
          <string-name>
            <surname>TopQuadrant</surname>
          </string-name>
          . Topbraid. www.topquadrant.com/topbraid/ composer/index.html.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>VA</given-names>
            <surname>Linux</surname>
          </string-name>
          <article-title>Systems</article-title>
          . PhpWiki. phpwiki.sourceforge.net/.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>Maria</given-names>
            <surname>Vargas-Vera</surname>
          </string-name>
          , Enrico Motta, John Domingue, Mattia Lanzoni, Arthur Stutt, and
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Ciravegna</surname>
          </string-name>
          . Mnm:
          <article-title>Ontology driven semi-automatic and automatic support for semantic markup</article-title>
          .
          <source>In EKAW '02: Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web</source>
          , pages
          <fpage>379</fpage>
          -
          <lpage>391</lpage>
          , London, UK,
          <year>2002</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [38]
          <string-name>
            <surname>W3C CDF Working</surname>
          </string-name>
          <article-title>Group</article-title>
          .
          <article-title>Compound document format</article-title>
          .
          <source>www.w3.org/</source>
          <year>2004</year>
          /CDF/.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>W3C</given-names>
            <surname>Semantic Web Activity</surname>
          </string-name>
          . Rdf/xml syntax specification.
          <source>www.w3</source>
          .org/TR/rdf-syntax-grammar/ .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>W3C</given-names>
            <surname>Semantic Web Activity</surname>
          </string-name>
          . Rfda. www.w3.org/TR/xhtml-rdfa-primer/ .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wartena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brussee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gazendam</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.-O.</given-names>
            <surname>Huijsen</surname>
          </string-name>
          .
          <article-title>A practical tool for semantic annotation</article-title>
          .
          <source>IEEE 18th Int. Conf. on Database and Expert Systems Applications (DEXA)</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>