The trouble with Bob <subject>

=Paper=
{{Paper
|id=Vol-467/paper-3
|storemode=property
|title=Docuphet – A Dialogue Assisted Content Annotation Tool
|pdfUrl=https://ceur-ws.org/Vol-467/paper3.pdf
|volume=Vol-467
}}
==Docuphet – A Dialogue Assisted Content Annotation Tool==
<pdf width="1500px">https://ceur-ws.org/Vol-467/paper3.pdf</pdf>
<pre>
     Docuphet – a dialogue-assisted content annotation tool

                              Mihály Héder                                              Domonkos Tikk∗
               Budapest University of Technology and                              Institute for Computer Science
                           Economics                                               Humboldt University in Berlin
                  mihaly.heder@computer.org                                     tikk@informatik.hu-berlin.de


ABSTRACT                                                                 cial markup language. This requires certain skills from the user and
In this decade the amount of textual content stored on the web be-       has a negative impact on the size of the potential audience.
came enormous, but the basic structure of web documents remained
unchanged: a mixture of text and markup. When creating a doc-            Another approach is to annotate the text off-line, after the content
ument, the user rarely has the possibility of embedding semantic         has been created, without human supervision. To achieve this nat-
annotation into the content, because editor applications do not have     ural language processing (NLP) technology, namely information
such a feature or they are too difficult to use. We think that the       extraction (IE) tools are required. Considering the huge amount of
creation of semantically rich documents can be best facilitated by       textual data already present on the Web, these researches are very
a content editor with text mining technology running in the back-        important.
ground. In this paper some of these technologies are brought into
spotlight.                                                               Undoubtedly, it would be fruitful to provide tools for the average
                                                                         user to ease the creation of semantic annotations when editing web
                                                                         content. The Docuphet [9] project aims at discovering and exper-
1.    INTRODUCTION                                                       imenting with the possible solutions of the problem. In our view
The success of the Wikipedia project illustrates the tremendous po-
                                                                         the web content creation should be a continuous, mutual dialogue
tential of the everyday web user for creating vast amount of content.
                                                                         between human and machine rather than a simple one-time input
Other content-creation projects hold control over the submitted ma-
                                                                         process.
terial by a rigorous review process or by a delegated editorial staff.
These initiatives have never been able to produce the same quan-
                                                                         This paper is organized as follows. In the following section we
tity of content. The NuPedia project [24] that was started before
                                                                         describe the motivations and goals behind Docuphet and the main
WikiPedia is now accessible only in the Internet Archive. Citi-
                                                                         components of the system. In Section 3 the technologies of the
zendium [5], another controlled encyclopedia, has only some ten
                                                                         implemented components are detailed. Section 4 presents two ex-
thousands articles, while WikiPedia has millions.
                                                                         ample applications of the system. In Section 5 we discuss our ex-
                                                                         periences, and propose some requirements on dialogue-assisted se-
Of course the quality of these articles is a subject of debate. The
                                                                         mantic annotators. Section 6 reviews the related work in the field
content representation, however, is similar in the majority of the
                                                                         of semantic annotation software. Section 7 concludes the paper and
cases. Almost every traditional encyclopedia-like content repos-
                                                                         discusses the future work.
itory uses some form of page formatting markup and plain text.
Some of them offer a limited vocabulary to categorize the article,
give the date, the creator, and specify some keywords.                   2.   THE DOCUPHET PROJECT
                                                                         The main goal of the Docuphet project is to create a system that
There are research projects aiming to capture not only the format-       enables the user to produce semantic annotation easily. This is
ting but also the semantics of the text at content editing. The ma-      achieved via an intuitive user interface that is supported by text
jority of these applications define themselves as “semantic wikis”.      processing algorithms in the background. We intend to capture the
They enable the embedding of semantic annotation (usually RDF            meaning of the currently edited content by addressing simple ques-
triples) by hand. That is, the user has to explicitly define the prop-   tions to the user. While the user types, the available text is pro-
erty and the value of the semantic expression involved, using a spe-     cessed by IE applets in the background. Then extracted facts are
∗
  On leave from Dept. of Telecommunications and Media Informat-          formulated as statements that are conveyed to the user in the form
ics, Budapest University of Technology and Economics                     of closed-ended questions. If the user confirms a statement, the
                                                                         system automatically embeds the appropriate semantic annotation
                                                                         into the text. The annotator software does not require any special
                                                                         technical knowledge or skill from the user, it is only assumed that
                                                                         she knows the particular content that she is editing, and hence she
                                                                         is able to decide if a related closed-ended question is true or not.

                                                                         Another property of the system is that it stores text and annota-
                                                                         tion together. This integration has many advantages. For instance
                                                                         when the text changes, no extra look-up operations are required to
                                                                         transfer the changes into the corresponding semantic annotations.
If the annotations were stored separately, the maintenance of extra
identifiers would be necessary to link text and annotation.

To make it accessible and runnable without installation, the client
part of the system runs in an ordinary web browser.

It is a natural requirement against every system to support multi-
ple languages. This requirement was considered when developing
every component of the system. However, many IE techniques are
language specific.1 For the first applications, the Hungarian was
selected as primary language.

Docuphet focuses only on the creation of annotations. The further
use the semantically annotated content, such as semantic search
and retrieval, or machine reasoning is currently out of the scope of
the project. We rendered our efforts to the design and the valida-
tion of our concept, therefore some components of the prototypical
applications are not yet optimized for the efficiency under heavy
load.

2.1    The main components of Docuphet
The user interacts with the Docuphet Content Editor’s (DCE) web
interface. The web application forwards the created content to the
server via AJAX calls. On the server the content is distributed to
various IE modules. The modules suggests annotation suggestions.
Each suggestion comprises a textual statement or question, a nu-
meric confidence level, one or more possible answers to the ques-        Figure 1: The overview of the components and the data flow of
tion, and semantic annotations (one per each positive answer). The       the Docuphet system.
confidence level is a real number of the unit interval, which spec-
ifies the validity of the suggestion. Under a certain (configurable)
level suggestions are automatically disregarded. The module’s sug-       The special requirements against DCE led to the development of a
gestions are collected on the server and are sent back to the client,    completely new solution. The DCE is an easy to use WYSiWYN2
where they are presented to the user in the form of pop-up closed-       editor, which is capable of editing the structure of a document with-
ended questions. If the user confirms a statement the system inserts     out the need of learning a markup language. DCE also handles the
the corresponding annotation into the content. When the user fin-        communication with the server, the representation of suggestions
ishes editing, the full annotated document is sent back to the server    and the integration of annotations (see also Figures 4–5 for screen-
and saved there. Figure 1 provides an overview of the system.            shots).

In the next section we discuss the components of Docuphet: the
content editor, the annotation storage, the text processing and in-      3.2    Content storage
formation extraction applets.                                            For selecting the best of the plethora of content storing formats we
                                                                         set up the following requirements:
3. TECHNOLOGY OVERVIEW                                                      1. simplicity to allow the implementation as web editor;
3.1 The content editor                                                      2. standard, stable and free to use;
There is an excess of tools for creating content on a computer.             3. proper support (documentation, examples, templates, editors,
These can be categorized in many ways. One aspect is the mode of               tools);
editing. There are WYSiWYG editors like desktop word proces-                4. extendibility to carry annotations;
sors. Other, structured editors have two views: one for editing and         5. support of the following formatting: paragraphs, sections
one for viewing the documents. At structured editors, the user han-            and titles; lists, program listings, emphasizes, images, tables,
dles objects like section, title, paragraph. This category includes            links.
wiki editors, publishing tools, DocBook editors and scientific ed-
itors for LATEX . In these applications the final formatting is done     Many options were evaluated: RTF , texinfo [13], troff [7], wikitext
with style class files or style sheets.                                  [42], XHTML, DocBook, DITA [25], LATEX , ODF, and CDF [38].

Another aspect is the technology of the editor. There are two main       Because of the large variety of convenient XML processing tools
categories: the more function-rich desktop applications that need        we dropped the non-XML formats. We also dropped ODF because
to be installed on the client, and the web-based editors [11, 34] that   of its high complexity. DITA and CDF (WCID) are too specific
require only a web browser to run. Web-based editors have been           for our purposes. From the remaining two candidates we decided
formerly simpler, but as AJAX become widespread, the complexity
                                                                         2
of such applications became almost equal to the desktop ones.              “What You See is What You Need” editors let the user to edit the
                                                                         structure of the document but not the source markup directly. They
1
  In fact, this is one of the central problems of text processing to     differ from WYSiWYG editors because further transformations are
provide language-independent information extraction methods              applied to generate the final view of the document.
in favour of DocBook, because this format is purely structural, it       The Docuphet framework contains a general purpose named en-
doesn’t contain any markup relating to the document formatting,          tity recognizer, the JNER, implemented in Java. This component
and it’s grammar is defined in an easy-to-subtype Relax NG [26]          comprises several modules, all of them analysing the text with a
format.                                                                  different technique. Many of them use vocabularies, e.g., for given
                                                                         names, company suffixes, locations. Others are based on regular
3.3     Storing semantic annotations in the con-                         expressions. It is possible to plug in external tools, such as a stem-
                                                                         mer or a morphological parser. Other external tools can serve as
        tent                                                             connectors to databases (i.e IMDB, DMOZ) or to search engines
Many possible technologies were evaluated for storing semantic           (google, wikia search).
annotations in a DocBook document: HTML Metadata, RDF XML
[39], GRDDL [14], Microformats and RDFa [40]. Finally we have            Although usually not considered as NEs, the recognition of pro-
selected RDFa because of its many advantages.                            fessions (like painter, composer, engineer) and human properties
                                                                         (blond, tall) are also supported in JNER.
The RDFa [40] has been developed by W3C and by now it is a
W3C recommendation. RDFa offers a technique that transforms an           Conforming with the philosophy of Docuphet (ask relevant ques-
arbitrary part of an XML document into an RDF triple. The tech-          tions from the user), in this system the NER can also be assisted by
nology is primarily aimed at annotating XHTML documents but              asking closed-ended questions from the user.
also capable of handling XML documents from other namespaces
(an example is depicted in Figure 2):
                                                                         3.4.2     Information Frame Recognition
<article                                                                 Let us define an information frame (IF) as an RDF triple (we use
 xmlns="http://docbook.org/ns/docbook"                                   Notation 3 syntax of RDF in this article for brevity)
 xmlns:dc="http://purl.org/dc/elements/1.1/">
   <title property="dc:title">
       The trouble with Bob                                              <subject>
   </title>                                                              <predicate>
   <para id="ch1" property="dc:creator">                                 <object> .
       Alice
   </para>
   <para about="#ch1" property="dc:creator">                             in which at least one value is missing and thus substituted by a vari-
       Tom                                                               able name. The class of the missing component(s) may be known.
   </para>                                                               An example IF:
   ...
</div>
                                                                         <X (an instance of the person class)>
                                                                         <location of birth>
                                                                         <Y (an instance of the location class)> .
 Figure 2: Excerpt of an RDFa-annotated Docbook document

RDFa was also favoured because it is easy to integrate with other        The information frame recognition (IFR) means the recognition of
XML namespaces, like Docbook. RDFa allows to annotate every              instances of an IF in the text by identifying the missing components
                                                                         of the triple. For instance in the sentence “Pat Nixon was born in
part of a document, while it is still relatively easy to retrieve the    Nevada in 1912”, we can recognize an instance of the above IF:
RDF triples from the XML. To conform DocBook with RDFa, we
extended the Docbook RNG schema to carry RDFa annotations,
and we termed this DocBook profile DocBook/RDFa.                         <Pat Nixon>
                                                                         <location of birt>
                                                                         <Nevada> .
3.4     Extracting semantic information from the
        text                                                             IFs can be defined in various ways. One method is to derive them
3.4.1     Named Entity Recognition                                       from “semantic frames” which are used in the Berkeley FrameNet
A named entity (NE) is a natural textual identifier of an object, such   project [1] (see an example below). The aim of this project is to
as e.g. person names, names of companies, locations, names of            create an annotated lexical resource for English, by using frame
                                                                         semantics and supported by corpus evidence. These frames are
products, addresses, telephone numbers, email addresses. Named           referring to pre-defined conceptual structures.
entity recognition (NER) is an NLP task, which aims at identifying
and classifying NEs in the text.
                                                                         frame(DESIGN),
The difficulty of the recognition task depends on the type of NEs.       inherit(CREATE),
Telephone numbers, e-mail addresses can be easily recognized with        frame_elements(DESIGNER(=CREATOR),
                                                                                               BUILDING(=WORK)),
simple regular expressions. The recognition of personal names and        scenes(DESIGNER desings BUILDING)
locations is more difficult, but it can be effectively supported with
appropriate vocabularies. The recognition of company or prod-
uct names can be much harder, because essentially no constraint          The frame elements are referring to certain semantic roles of actors
applies on their surface form. NER is primarily performed by             and objects present in the scene. In the FrameNet project, human
analysing some features of the candidate NEs, e.g. surface clues         annotators mark the occurrences of the frame elements in texts.
(capitalization, numbers, special symbols), the frequency, the in-
sentence, in-paragraph, and in-document positions, whereas gram-         Evidently, to find the possible semantic role of a text element, it is
matical and morphological analysis may also be applied.                  very helpful to have the named entities and their types identified
beforehand. In some cases additional rules may also be useful,
i.e. specifying constraints on the relative position of the element
of an IF (in-sentence, in-paragraph). While limiting the number
of potentially identifiable IFs, this constraint has many practical
advantages: it enables to start the IFR process before the whole
content is available, and narrows significantly the search space thus
reducing the computation time.

JFrame is the IFR component of the Docuphet framework, written
in Java. IFs can be defined and the corresponding recognition rules
can be given in JFrame. The input of the module is a token stream,
in which the recognized NEs and their types are already marked.
A JFrame IF definition may contain rules related to the class and
lemmas of NEs in the token stream and have conditions on their
order. JFrame also provides a confidence level for every recognized
IF instance in accordance to the concept of Docuphet’s workflow.

Currently in Docuphet, the IFR rules are defined by hand after ex-
perimentation. This is necessary because:

     • There isn’t enough properly annotated text which can be used
       as training data.
     • The annotation method in Docuphet is based on dialogues.
       Therefore, for each IF a simple function must be defined,
       which generates the corresponding closed-ended questions.

3.4.3      Sentence segmentation
To support the recognition of IFs, a sentence segmenter was de-                     Figure 3: A typical workflow of Docuphet
veloped, named as JSentence. The component implements a mod-
ified version of the algorithm described in [33]. The Hungarian
configuration of the tool was tested on the Szeged2 corpus [31],               person name recognition (male and female distinguished)
the biggest multi-thematic text corpus in Hungarian. The corpus                based on regular expressions and given name vocabulary, lo-
contains 82096 sentences, and consists of complete novels from                 cation, nationality, profession, education, residence, family
various authors and genre, high school essays, general newspa-                 status, social relations (friends/other related people) recog-
per articles, legal texts, computer related handbooks, and economic            nition based on vocabularies, email and phone number recog-
news. JSentence recognized sentence boundaries with a precision                nition based on regular expressions and date recognition based
of 99.06 % at F P ≈ F N . JSentence uses a rule-based algorithm                on a custom Java component, regular expressions and a list
with 15 regular expression based rules and 4 abbreviation lists.               of month names.
                                                                            2. JNER assigns a confidence level to every recognized NE,
                                                                               which is calculated by summing the pre-defined values of
4.     APPLICATION EXAMPLES                                                    matching rules.3 Over confidence level 0.95, an annotation
In this section two demo applications of Docuphet are presented.               suggestion is created with the following RDF triple:
BioBase is a web site to collect biographies of known people (sim-             <article id>
ilar to the ones in [18]), user autobiographies, or simple self-intro-         <related to [NE type]>
ductory texts. FlatBase is a real estate advertisement portal. In the          <[NE value]>.
case of the biographies, Docuphet is configured to recognize IFs               where NE type and NE value are substituted with the actual
based only on the entered text. For the FlatBase portal, the en-               values. The annotation suggestions also have level of con-
tered text is analysed first, and when some relevant information are           fidence4 that is set to 0.95. DCE accepts every annotation
still missing (e.g. floor number) Docuphet automatically asks ques-            above 0.9 without asking a confirming question from the
tions from the user to complete the advertisement database prop-               user, in order to avoid of flushing the user with questions.5
erly. Both applications are configured to work on Hungarian text.              The target of the annotations (the location where the annota-
                                                                               tion is placed) is the node before the given paragraph. From
The workflow is the same in both scenarios as described in Sec-                these annotations, a tag cloud can be generated with typed
tion 2.1. Both applications uses DCE and the same document server              (see also Figure 4).
component versions. They differ in the way how the annotations are          3. For the top three person and location NEs with confidence
produced, thus we discuss this part in detail next.                            level between 0.8 and 0.95, an annotation suggestion of value

                                                                         3
4.1      BioBase                                                           The creator of a rule can set both positive and negative confidence
                                                                         coefficients.
In BioBase a two layered IF recognition has been implemented. In         4
                                                                           This is based on the NE confidence level, but can be weighted, e.g.
the first layer NEs are identified as follows:                           when a candidate person NE shares the surname with an already
                                                                         recognized NE.
     1. JNER analyses the text configured with all the NER rules         5
                                                                           Every annotation can be easily removed by deletion or with the
        available for Hungarian language. Currently this includes        undo action.
                                                                                        Figure 5: A screenshot from FlatBase.
              Figure 4: A screenshot from BioBase

                                                                         tion of the advertisement are attempted to be identified:
      0.8 is created with the question “This article is related to the
      person/location (value)” with the same target and RDF triple             • type of property: house, condominium, apartment etc.,
      as in the previous step.                                                 • type of offer: for rent, for sale,
                                                                               • location: countryside, city,
Based on the NEs recognized, the following IE attempts are made:               • price range,
                                                                               • size,
   1. First, the person is identified whom the article goes. This is           • contact information.
      done by creating suggestions on the first person NE found
      (in the title or in the text) with the triple                      When a part of main information becomes available, the details are
      <article id>
      <describes the life of>                                            attempted to be extracted. Some examples:
      <[NE value]>.
      where the target is the beginning of the article, the confidence         • details about the location (district or quarter, street name);
      level is 0.8. This is repeated with the subsequent person NEs            • building materials;
      until a positive answer is given by the user. (In our expe-              • in case of flats: on which particular floor is the property; is
      rience the article is nearly always about the first mentioned              this the top/ground floor;
      person).                                                                 • for certain types of property: orientation (street, garden);
   2. If the subject of the article is known, the date and place of            • for a house: size of the garden;
      birth and death are attempted to be identified next. This is             • type of heating (depends on the property type);
      done by analysing the date and location tokens in range of               • arrangement of rooms (e.g. separated entrance);
      a centered window of 15 tokens around the occurrences the                • public transport facilities (different types, based on the city).
      identified person. Extra confidence is added if the verb is
      in past tense form “született” (was born), “elhunyt | meghalt”    On both levels, specially configured JNER instances are used. The
      (died) around a particular date or location (these are often       configuration initially includes a shorter list of locations, regular
      omitted however).                                                  expression based rules to recognize price, size, and contact infor-
   3. Nationality, profession, education, residence, friends and par-    mation. IFR is then performed based on already known NEs and
      ent’s name are identified in a similar way as described in the     certain trigger words related to building materials, heating types,
      previous point, involving certain terms (his mother | father)      etc. When some information is extracted, the JNERs may be recon-
      where appropriate.                                                 figured accordingly (e.g. loading the corresponding quarters when
                                                                         a district of Budapest is found).
Using these annotations the persons can be categorized by the era,
which of they lived or live in, the profession, the nationality or lo-   When some of the main information are missing from the ad by
cation. All the RDF properties used in the process are in BioBase’s      saving, the user is asked to complete.6 If this happens again, a
namespace, but if required, they can be easily mapped into other         limited number of direct questions are formulated related to the
namespaces, like FOAF [12].                                              missing information, every question only once. At the end, the ad
                                                                         is saved even if some information are still missing.
Our experience with BioBase shows that Docuphet is able to cap-
ture the basic biographic information about a person. When creat-        FlatBase has proven to be very effective because flat advertisements
ing a 500 word long biography (see Figure 4), the system pops up         are usually short, very similar to each other, and their vocabulary
10–15, mostly adequate questions. One direction of further work          is limited. As a result, Flatbase is capable of extracting nearly all
could be the to addition of new event IFs based on the review of         (usually not more than 5–10, see Figure 5) important information
biographies.                                                             from an advertisement. This suggest that FlatBase may become a
                                                                         user-friendly input-assisting tool for advertisement portals.

4.2    FlatBase                                                          5.      DISCUSSION
IFR performed in FlatBase also in two layers, but in a completely
                                                                         6
different way than in BioBase. On the first level, the main informa-         But the actual content is stored anyway.
Bodain and Robert composed five requirements on the static prop-                  of text, then many suggestions may be generated. These must
erties of semantic annotations [3]:                                               be asked in several turns.

   • robust anchors,                                                       6.     RELATED WORK
   • transparency,                                                         In the last 15 years, an excess of semantic annotation tools had
   • freedom in choosing the semantic vocabulary,                          been developed. Here we recall and compare the most important
   • variable granularity,                                                 ones (see also Table 1).
   • handling dynamic updates.
                                                                           We can divide the semantic annotators into two main groups:
For dialogue-assisted annotators, such as Docuphet, most of the
above requirements can be carried over7 , but as an outcome of our              • Semantic Wikis: One form of inserting semantic annota-
experimentation, we can now formulate additional requirements on                  tions into documents is via semantic wikis, such as Semantic
the semantic annotation creation:                                                 Mediawiki [23], Artificial Memory [21], Kaukolu [8], PHP-
                                                                                  Wiki [36], IkeWiki [30], and SWiM [19]. These applications
   1. A particular question must never be asked twice to avoid of                 enable the user to input RDF data by using a special syn-
      the discontent of the user. This implies that:                              tax. The available semantic vocabulary and the granularity
   2. Every suggestion must be stored in the document, regardless                 of the annotations vary in these applications, but in all cases
      from the answer, if any. Per-session storage is not sufficient              semantic handling skill is required from the user.
      since the document can be edited in several sessions. Further             • Desktop ontology builders and annotators: This group
      research is needed to find out whether suggestions should be                contains some feature-rich desktop annotators for authoring
      stored on a per-document or a per-user basis.                               semantically annotated documents. Protégé [28], TopBraid
   3. There are two main types of suggestions depending on the                    [35], Amaya [29], and Mangrove [22] are frameworks for
      positive/negative answer of the user. As a future work it                   building ontologies and knowledge graphs. SWEDT [27],
      should be investigated how negative answers can be exploited                Apolda [41], and KATIA [3] have rich document editing and
      at annotation.                                                              annotating capabilities. These are professional tools for know-
   4. According to the “handling dynamic updates” rule described                  ledge experts.
      in [3], the annotations must be re-validated upon every text                S-CREAM [15] integrates the Amilcare [4] IE module that
      change. Given our point 1, this requirement can hardly be                   implements a semi-supervised machine learning method: a
      met. Theoretically it is possible to insert a statement into                set of training data must be annotated by hand in advance,
      the first part of a document which negates the meaning of                   then Amilcare creates certain annotations automatically in
      everything in a given scope. To handle this appropriately, we               new documents.
      should either fully capture the meaning of the change and                   COHSE [2] highlights text and provide additional informa-
      update the right annotations or, re-ask every question. The                 tion for strings matching elements of a pre-defined knowl-
      first solution is yet not possible to carry out, the latter causes          edge base. Magpie [10] allows annotation of a pre-defined
      a high number of undesirable questions. To get around this                  set of concepts based on forms.
      problem, we devised two techniques:
          • Limited scope: We have defined two types of annota-            Docuphet has many in common in the visualization of annotations
             tions: basic and derived. Basic annotations regard only       with SWEDT [27] and KATIA [3]. But unlike these tools, Docu-
             to a specific text scope (a title, a paragraph, list item).   phet’s editor hides the annotation markup details from the user and
             We assume that changes outside the scope do not af-           provides annotation visualization instead.
             fect them. To fulfil this assumption, basic annotations
             are very simple e.g. “This paragraph is related to Bu-        Like Melita [32], AKTiveDoc [20], MnM [37], or S-CREAM [15]
             dapest” when the token Budapest is present. These an-         Docuphet also uses IE technology to extract semantic annotation
             notations are usually retrieved by NER as described in        candidates from the text. However, the way Docuphet uses IE is
             Section 4.1, and re-validated upon text change in their       quite dissimilar from these tools, since it uses IFs as a common
             scope.                                                        concept of semantic information and involves the user into the pro-
          • Dependencies: We define a dependency graph of anno-            cess.
             tations. Derived annotations depend on basic or other
             derived annotations. The dependency is tracked with a         Like COHSE and Magpie, Docuphet is based on pre-defined con-
             list of annotation IDs. Every time an annotation changes,     cepts and relations that are termed information frames. The tar-
             its dependants should be re-validated. If an annotation       geted audience of unskilled users and the NLP- and IE-based dia-
             is deleted, the derived annotations must be deleted as        logue-assisted annotation creation render our solution rather unique.
             well. Furthermore, a derived annotation may require           Our approach is comparable to the question-sequence-based guid-
             a re-validation when non-annotated parts of the text          ance provided by some complex installation wizards, where the
             change, since some derived annotations may depend on          questions are based on the information gathered earlier. In Docu-
             the characteristics of the text or on missing annotation.     phet the set of IFR elements represents the knowledge base, which
             For example if the user accepts a new “main category”         provides a sophisticated, flexible and order-independent solution.
             annotation, the old one must be re-validated or simply
             deleted.                                                      7.     CONCLUSION AND FUTURE WORK
   5. The number of questions asked together has to be limited.
                                                                           Docuphet is dialogue-assisted semantic text annotator. The com-
      This is important because if the user pastes in a larger piece
                                                                           puter–human dialogue is facilitated by IE techniques: named en-
7                                                                          tity recognition and information frame extraction. In Docuphet—
  although we applied a pre-defined semantic vocabulary because
of the nature of the system                                                unlike some semantic wikis—it is not possible to annotate the text
                                                Table 1: Comparing Docuphet to other solutions
    Name                                  Platform                 Editor type                              Storage format
    Semantic annotation input             Semantic vocabulary      Way of storage                           Skills required
    WikiPedia                             Web                      textarea                                 wikitext
    –                                     –                        –                                        –
    Semantic MediaWiki                    Web                      textarea                                 wikitext
    markup                                arbitrary                special wikitext                         ontologies, markup
    SWEDT                                 Eclipse                  source code editor                       HTML
    form+RDF source                       arbitrary                RDF                                      Web development, ontologies
    Katia                                 desktop Java             word processor                           HTML
    Drag-and-Drop                         predefined               RDF server                               ontologies
    Amaya                                 desktop application      source code editor                       HTML
    forms                                 arbitrary                RDF server                               Web development, ontologies, RDF
    Melita                                desktop application      word processor                           HTML
    forms+named entity suggestionsa       predefined               RDF database                             ontologies
    Docuphet                              Web                      What You See is What You Need            Docbook
    suggestions in natural language       predefined               RDFa                                     –
a
    It collects all named entities as potential instances of ontology classes


and build the corresponding ontology simultaneously. Therefore                  The relevance of the questions can be improved if topical category
Docuphet is only capable of handling pre-defined RDF information                labels are available for the documents. Therefore we plan to pre-
triples, which limits the flexibility of the system. On the other hand,         pare the Docuphet to collaborate with document classifiers, such as
this very property allows to compose easy-to-understand questions               e.g. the hitec3 framework [16].
about the known triples, as the questions are defined together with
the corresponding IFs. This way it is easy to create annotations                Acknowledgement
even for the completely uninitiated users.                                      Domonkos Tikk was supported by the Alexander von Humboldt
                                                                                Foundation.
Given these properties, Docuphet is most useful when the domain
of the text is known in advance. Two exemplary applications were
presented in Section 4. Other possible applications include anno-               8.    REFERENCES
                                                                                 [1] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. The berkeley
tation of economic or sport news, product reviews or geolocation                     framenet project. In Proceedings of the 17th international conference
reviews. In these cases the set of appropriate IFs and corresponding                 on Computational linguistics, pages 86–90, Morristown, NJ, USA,
IFR rules have to be created in advance.                                             1998. Association for Computational Linguistics.
                                                                                 [2] S. Bechhofer, C. Goble, L. Carr, and S. Kampa. COHSE: Semantic
Despite these limitations, there is a basic functionality available                  web gives a better deal for the whole web? ISWC International
without specific domain knowledge. Docuphet is capable of rec-                       Semantic Web Conference Poster, 2002.
ognizing NEs in an arbitrary text, and formulating questions about               [3] Y. Bodain and J.-M. Robert. Developing a robust authoring
the NE candidates. This makes it a very useful tool for building                     annotation system for the semantic web. Proc. of 7th IEEE Int. Conf.
NE databases and for disambiguation applications. Another possi-                     on Advanced Learning Technologies, 2007.
ble application area is the assistance of context sensitive browser              [4] F. Ciravegna, A. Dingli, Y. Wilks, and D. Petrelli. Amilcare: adaptive
applications tools, such as In4’s iGlue [17] and Context Discovery                   information extraction for document annotation. SIGIR’02: Proc. of
Inc.’s Context Organizer [6].                                                        the 25th Int. ACM SIGIR Conf. on Research and Development in
                                                                                     Information Retrieval, pages 367–368, 2002.
As for the future work, we intend to enable Docuphet to access                   [5] Citizendium. en.wikipedia.org/wiki/Citizendium.
and edit wikipedia articles via the interface provided by the Medi-              [6] Context Discovery Inc. Context Organizer for the Web.
aWiki’s public API. As wikipedia uses the wikitext format, being                     http://www.contextdiscovery.com/
very different from Docbook/RDFa, apparently the most problem-                       context-organizer-for-the-web.aspx.
atic is the conversion of the articles, and the placement of the the             [7] R. Corderoy. troff. www.troff.org/.
annotations in wikitext. We also plan to integrate Docuphet with
                                                                                 [8] DFKI Knowledge Management. Kaukolu.
large public databases like IMDB, to facilitate disambiguation and                   www.dfki.de/web/forschung/km/.
named entity recognition.
                                                                                 [9] The Docuphet project. www.docuphet.net.
We think that in machine understanding, bidirectional communi-                  [10] J. Domingue, M. Dzbor, and E. Motta. Semantic layering with
cation — questions and answers — is a key element — just like                        magpie. In Handbook on Ontologies, pages 533–554. Springer, 2004.
in human understanding. However, we admit that if the questions                 [11] FCKEditor. www.fckeditor.net/.
are not relevant enough, this proactive behavior probably causes                [12] The Friend Of A Friend project. www.foaf-project.org/.
discontent on the user’s part. To find out more about the users’ re-            [13] Free Software Foundation. texinfo.
actions when using our system, we intend to conduct experiments                      www.gnu.org/software/texinfo/.
and surveys with many users.
                                                                                [14] GRDDL Working Group. Gleaning resource descriptions from
                                                                                     dialects of languages. www.w3.org/TR/grddl/.
[15] S. Handschuh, S. Staab, and F. Ciravegna. S-cream-semi-automatic
     creation of metadata. Proc. of the European Conf. on Knowledge
     Acquisition and Management, 2002.
[16] HITEC. categorizer.tmit.bme.hu/trac/wiki.
[17] In4 Ltd. The iGlue project. http://iglue.com/beta/.
[18] Á. Kenyeres. Magyar Életrajzi Lexikon ((Hungarian Biography
     Encyclopedia)). Arcanum Adatbázis Kft, 1994.
[19] KWARC. SWiM: A semantic wiki for mathematical knowledge
     management. kwarc.info/projects/swim/.
[20] Vitaveska Lanfranchi, Fabio Ciravegna, Phil Moore, and Daniela
     Petrelli. Document editing and browsing in aktivedoc. In DocEng
     ’05: Proceedings of the 2005 ACM symposium on Document
     engineering, pages 237–238, New York, NY, USA, 2005. ACM.
[21] L. Ludwig. Artificial memory. www.artificialmemory.net/.
[22] L. McDowell, O. Etzioni, S. D. Gribble, A. Y. Halevy, H. M.Levy,
     W. Pentney, D. Verma, and S. Vlasseva. Mangrove: Enticing ordinary
     people onto the semantic web via instant gratification. Proc. of
     International Semantic Web Conference, pages 754–770, 2003.
[23] Semantic Mediawiki. semantic-mediawiki.org/wiki/
     Semantic\%5FMediaWiki.
[24] Nupedia. en.wikipedia.org/wiki/Nupedia.
[25] OASIS DITA Technical Committee. Darwin information typing
     architecture. www.oasis-open.org/committees/dita.
[26] OASIS Relax-NG committee. Relax-ng.
     www.oasis-open.org/committees/relax-ng.
[27] R. G. Pereira and M. M. Freire. SWedt: A semantic web editor
     integrating ontologies and semantic annotations with resource
     description framework. IEEE Int. Conf. on Internet and Web
     Applications and Services, pages 200–200, 2006.
[28] Protégé. protege.stanford.edu/.
[29] V. Quint and I. Vatton. An introduction to Amaya. Wide Web J., 1997.
[30] Salzburg Research. IkeWiki.
     ikewiki.salzburgresearch.at/.
[31] Szegedi Tudományegyetem Nyelvtechnológiai Csoport. Szeged
     korpusz 2. www.inf.u-szeged.hu/projectdirs/hlt/.
[32] Advanced Knowledge Technologies. Melita.
     howhttp://www.aktors.org/technologies/melita/.
[33] D. Tikk. Szövegbányászat, chapter 2. TypoTEX, 2007.
[34] Tiny Moxiecode Content Editor (TinyMCE).
     tinymce.moxiecode.com/.
[35] TopQuadrant. Topbraid. www.topquadrant.com/topbraid/
     composer/index.html.
[36] VA Linux Systems. PhpWiki. phpwiki.sourceforge.net/.
[37] Maria Vargas-Vera, Enrico Motta, John Domingue, Mattia Lanzoni,
     Arthur Stutt, and Fabio Ciravegna. Mnm: Ontology driven
     semi-automatic and automatic support for semantic markup. In
     EKAW ’02: Proceedings of the 13th International Conference on
     Knowledge Engineering and Knowledge Management. Ontologies
     and the Semantic Web, pages 379–391, London, UK, 2002.
     Springer-Verlag.
[38] W3C CDF Working Group. Compound document format.
     www.w3.org/2004/CDF/.
[39] W3C Semantic Web Activity. Rdf/xml syntax specification.
     www.w3.org/TR/rdf-syntax-grammar/.
[40] W3C Semantic Web Activity. Rfda.
     www.w3.org/TR/xhtml-rdfa-primer/.
[41] C. Wartena, R. Brussee, L. Gazendam, and W.-O. Huijsen. A
     practical tool for semantic annotation. IEEE 18th Int. Conf. on
     Database and Expert Systems Applications (DEXA), 2007.
[42] wikitext. en.wikipedia.org/wiki/Wikitext.

</pre>