=Paper= {{Paper |id=Vol-467/paper-3 |storemode=property |title=Docuphet – A Dialogue Assisted Content Annotation Tool |pdfUrl=https://ceur-ws.org/Vol-467/paper3.pdf |volume=Vol-467 }} ==Docuphet – A Dialogue Assisted Content Annotation Tool== https://ceur-ws.org/Vol-467/paper3.pdf
     Docuphet – a dialogue-assisted content annotation tool

                              Mihály Héder                                              Domonkos Tikk∗
               Budapest University of Technology and                              Institute for Computer Science
                           Economics                                               Humboldt University in Berlin
                  mihaly.heder@computer.org                                     tikk@informatik.hu-berlin.de



ABSTRACT                                                                 cial markup language. This requires certain skills from the user and
In this decade the amount of textual content stored on the web be-       has a negative impact on the size of the potential audience.
came enormous, but the basic structure of web documents remained
unchanged: a mixture of text and markup. When creating a doc-            Another approach is to annotate the text off-line, after the content
ument, the user rarely has the possibility of embedding semantic         has been created, without human supervision. To achieve this nat-
annotation into the content, because editor applications do not have     ural language processing (NLP) technology, namely information
such a feature or they are too difficult to use. We think that the       extraction (IE) tools are required. Considering the huge amount of
creation of semantically rich documents can be best facilitated by       textual data already present on the Web, these researches are very
a content editor with text mining technology running in the back-        important.
ground. In this paper some of these technologies are brought into
spotlight.                                                               Undoubtedly, it would be fruitful to provide tools for the average
                                                                         user to ease the creation of semantic annotations when editing web
                                                                         content. The Docuphet [9] project aims at discovering and exper-
1.    INTRODUCTION                                                       imenting with the possible solutions of the problem. In our view
The success of the Wikipedia project illustrates the tremendous po-
                                                                         the web content creation should be a continuous, mutual dialogue
tential of the everyday web user for creating vast amount of content.
                                                                         between human and machine rather than a simple one-time input
Other content-creation projects hold control over the submitted ma-
                                                                         process.
terial by a rigorous review process or by a delegated editorial staff.
These initiatives have never been able to produce the same quan-
                                                                         This paper is organized as follows. In the following section we
tity of content. The NuPedia project [24] that was started before
                                                                         describe the motivations and goals behind Docuphet and the main
WikiPedia is now accessible only in the Internet Archive. Citi-
                                                                         components of the system. In Section 3 the technologies of the
zendium [5], another controlled encyclopedia, has only some ten
                                                                         implemented components are detailed. Section 4 presents two ex-
thousands articles, while WikiPedia has millions.
                                                                         ample applications of the system. In Section 5 we discuss our ex-
                                                                         periences, and propose some requirements on dialogue-assisted se-
Of course the quality of these articles is a subject of debate. The
                                                                         mantic annotators. Section 6 reviews the related work in the field
content representation, however, is similar in the majority of the
                                                                         of semantic annotation software. Section 7 concludes the paper and
cases. Almost every traditional encyclopedia-like content repos-
                                                                         discusses the future work.
itory uses some form of page formatting markup and plain text.
Some of them offer a limited vocabulary to categorize the article,
give the date, the creator, and specify some keywords.                   2.   THE DOCUPHET PROJECT
                                                                         The main goal of the Docuphet project is to create a system that
There are research projects aiming to capture not only the format-       enables the user to produce semantic annotation easily. This is
ting but also the semantics of the text at content editing. The ma-      achieved via an intuitive user interface that is supported by text
jority of these applications define themselves as “semantic wikis”.      processing algorithms in the background. We intend to capture the
They enable the embedding of semantic annotation (usually RDF            meaning of the currently edited content by addressing simple ques-
triples) by hand. That is, the user has to explicitly define the prop-   tions to the user. While the user types, the available text is pro-
erty and the value of the semantic expression involved, using a spe-     cessed by IE applets in the background. Then extracted facts are
∗
  On leave from Dept. of Telecommunications and Media Informat-          formulated as statements that are conveyed to the user in the form
ics, Budapest University of Technology and Economics                     of closed-ended questions. If the user confirms a statement, the
                                                                         system automatically embeds the appropriate semantic annotation
                                                                         into the text. The annotator software does not require any special
                                                                         technical knowledge or skill from the user, it is only assumed that
                                                                         she knows the particular content that she is editing, and hence she
                                                                         is able to decide if a related closed-ended question is true or not.

                                                                         Another property of the system is that it stores text and annota-
                                                                         tion together. This integration has many advantages. For instance
                                                                         when the text changes, no extra look-up operations are required to
                                                                         transfer the changes into the corresponding semantic annotations.
If the annotations were stored separately, the maintenance of extra
identifiers would be necessary to link text and annotation.

To make it accessible and runnable without installation, the client
part of the system runs in an ordinary web browser.

It is a natural requirement against every system to support multi-
ple languages. This requirement was considered when developing
every component of the system. However, many IE techniques are
language specific.1 For the first applications, the Hungarian was
selected as primary language.

Docuphet focuses only on the creation of annotations. The further
use the semantically annotated content, such as semantic search
and retrieval, or machine reasoning is currently out of the scope of
the project. We rendered our efforts to the design and the valida-
tion of our concept, therefore some components of the prototypical
applications are not yet optimized for the efficiency under heavy
load.

2.1    The main components of Docuphet
The user interacts with the Docuphet Content Editor’s (DCE) web
interface. The web application forwards the created content to the
server via AJAX calls. On the server the content is distributed to
various IE modules. The modules suggests annotation suggestions.
Each suggestion comprises a textual statement or question, a nu-
meric confidence level, one or more possible answers to the ques-        Figure 1: The overview of the components and the data flow of
tion, and semantic annotations (one per each positive answer). The       the Docuphet system.
confidence level is a real number of the unit interval, which spec-
ifies the validity of the suggestion. Under a certain (configurable)
level suggestions are automatically disregarded. The module’s sug-       The special requirements against DCE led to the development of a
gestions are collected on the server and are sent back to the client,    completely new solution. The DCE is an easy to use WYSiWYN2
where they are presented to the user in the form of pop-up closed-       editor, which is capable of editing the structure of a document with-
ended questions. If the user confirms a statement the system inserts     out the need of learning a markup language. DCE also handles the
the corresponding annotation into the content. When the user fin-        communication with the server, the representation of suggestions
ishes editing, the full annotated document is sent back to the server    and the integration of annotations (see also Figures 4–5 for screen-
and saved there. Figure 1 provides an overview of the system.            shots).

In the next section we discuss the components of Docuphet: the
content editor, the annotation storage, the text processing and in-      3.2    Content storage
formation extraction applets.                                            For selecting the best of the plethora of content storing formats we
                                                                         set up the following requirements:
3. TECHNOLOGY OVERVIEW                                                      1. simplicity to allow the implementation as web editor;
3.1 The content editor                                                      2. standard, stable and free to use;
There is an excess of tools for creating content on a computer.             3. proper support (documentation, examples, templates, editors,
These can be categorized in many ways. One aspect is the mode of               tools);
editing. There are WYSiWYG editors like desktop word proces-                4. extendibility to carry annotations;
sors. Other, structured editors have two views: one for editing and         5. support of the following formatting: paragraphs, sections
one for viewing the documents. At structured editors, the user han-            and titles; lists, program listings, emphasizes, images, tables,
dles objects like section, title, paragraph. This category includes            links.
wiki editors, publishing tools, DocBook editors and scientific ed-
itors for LATEX . In these applications the final formatting is done     Many options were evaluated: RTF , texinfo [13], troff [7], wikitext
with style class files or style sheets.                                  [42], XHTML, DocBook, DITA [25], LATEX , ODF, and CDF [38].

Another aspect is the technology of the editor. There are two main       Because of the large variety of convenient XML processing tools
categories: the more function-rich desktop applications that need        we dropped the non-XML formats. We also dropped ODF because
to be installed on the client, and the web-based editors [11, 34] that   of its high complexity. DITA and CDF (WCID) are too specific
require only a web browser to run. Web-based editors have been           for our purposes. From the remaining two candidates we decided
formerly simpler, but as AJAX become widespread, the complexity
                                                                         2
of such applications became almost equal to the desktop ones.              “What You See is What You Need” editors let the user to edit the
                                                                         structure of the document but not the source markup directly. They
1
  In fact, this is one of the central problems of text processing to     differ from WYSiWYG editors because further transformations are
provide language-independent information extraction methods              applied to generate the final view of the document.
in favour of DocBook, because this format is purely structural, it       The Docuphet framework contains a general purpose named en-
doesn’t contain any markup relating to the document formatting,          tity recognizer, the JNER, implemented in Java. This component
and it’s grammar is defined in an easy-to-subtype Relax NG [26]          comprises several modules, all of them analysing the text with a
format.                                                                  different technique. Many of them use vocabularies, e.g., for given
                                                                         names, company suffixes, locations. Others are based on regular
3.3     Storing semantic annotations in the con-                         expressions. It is possible to plug in external tools, such as a stem-
                                                                         mer or a morphological parser. Other external tools can serve as
        tent                                                             connectors to databases (i.e IMDB, DMOZ) or to search engines
Many possible technologies were evaluated for storing semantic           (google, wikia search).
annotations in a DocBook document: HTML Metadata, RDF XML
[39], GRDDL [14], Microformats and RDFa [40]. Finally we have            Although usually not considered as NEs, the recognition of pro-
selected RDFa because of its many advantages.                            fessions (like painter, composer, engineer) and human properties
                                                                         (blond, tall) are also supported in JNER.
The RDFa [40] has been developed by W3C and by now it is a
W3C recommendation. RDFa offers a technique that transforms an           Conforming with the philosophy of Docuphet (ask relevant ques-
arbitrary part of an XML document into an RDF triple. The tech-          tions from the user), in this system the NER can also be assisted by
nology is primarily aimed at annotating XHTML documents but              asking closed-ended questions from the user.
also capable of handling XML documents from other namespaces
(an example is depicted in Figure 2):
                                                                         3.4.2     Information Frame Recognition
The trouble with Bob <subject> . Alice in which at least one value is missing and thus substituted by a vari- Tom able name. The class of the missing component(s) may be known. An example IF: ... . Figure 2: Excerpt of an RDFa-annotated Docbook document RDFa was also favoured because it is easy to integrate with other The information frame recognition (IFR) means the recognition of XML namespaces, like Docbook. RDFa allows to annotate every instances of an IF in the text by identifying the missing components of the triple. For instance in the sentence “Pat Nixon was born in part of a document, while it is still relatively easy to retrieve the Nevada in 1912”, we can recognize an instance of the above IF: RDF triples from the XML. To conform DocBook with RDFa, we extended the Docbook RNG schema to carry RDFa annotations, and we termed this DocBook profile DocBook/RDFa. . 3.4 Extracting semantic information from the text IFs can be defined in various ways. One method is to derive them 3.4.1 Named Entity Recognition from “semantic frames” which are used in the Berkeley FrameNet A named entity (NE) is a natural textual identifier of an object, such project [1] (see an example below). The aim of this project is to as e.g. person names, names of companies, locations, names of create an annotated lexical resource for English, by using frame semantics and supported by corpus evidence. These frames are products, addresses, telephone numbers, email addresses. Named referring to pre-defined conceptual structures. entity recognition (NER) is an NLP task, which aims at identifying and classifying NEs in the text. frame(DESIGN), The difficulty of the recognition task depends on the type of NEs. inherit(CREATE), Telephone numbers, e-mail addresses can be easily recognized with frame_elements(DESIGNER(=CREATOR), BUILDING(=WORK)), simple regular expressions. The recognition of personal names and scenes(DESIGNER desings BUILDING) locations is more difficult, but it can be effectively supported with appropriate vocabularies. The recognition of company or prod- uct names can be much harder, because essentially no constraint The frame elements are referring to certain semantic roles of actors applies on their surface form. NER is primarily performed by and objects present in the scene. In the FrameNet project, human analysing some features of the candidate NEs, e.g. surface clues annotators mark the occurrences of the frame elements in texts. (capitalization, numbers, special symbols), the frequency, the in- sentence, in-paragraph, and in-document positions, whereas gram- Evidently, to find the possible semantic role of a text element, it is matical and morphological analysis may also be applied. very helpful to have the named entities and their types identified beforehand. In some cases additional rules may also be useful, i.e. specifying constraints on the relative position of the element of an IF (in-sentence, in-paragraph). While limiting the number of potentially identifiable IFs, this constraint has many practical advantages: it enables to start the IFR process before the whole content is available, and narrows significantly the search space thus reducing the computation time. JFrame is the IFR component of the Docuphet framework, written in Java. IFs can be defined and the corresponding recognition rules can be given in JFrame. The input of the module is a token stream, in which the recognized NEs and their types are already marked. A JFrame IF definition may contain rules related to the class and lemmas of NEs in the token stream and have conditions on their order. JFrame also provides a confidence level for every recognized IF instance in accordance to the concept of Docuphet’s workflow. Currently in Docuphet, the IFR rules are defined by hand after ex- perimentation. This is necessary because: • There isn’t enough properly annotated text which can be used as training data. • The annotation method in Docuphet is based on dialogues. Therefore, for each IF a simple function must be defined, which generates the corresponding closed-ended questions. 3.4.3 Sentence segmentation To support the recognition of IFs, a sentence segmenter was de- Figure 3: A typical workflow of Docuphet veloped, named as JSentence. The component implements a mod- ified version of the algorithm described in [33]. The Hungarian configuration of the tool was tested on the Szeged2 corpus [31], person name recognition (male and female distinguished) the biggest multi-thematic text corpus in Hungarian. The corpus based on regular expressions and given name vocabulary, lo- contains 82096 sentences, and consists of complete novels from cation, nationality, profession, education, residence, family various authors and genre, high school essays, general newspa- status, social relations (friends/other related people) recog- per articles, legal texts, computer related handbooks, and economic nition based on vocabularies, email and phone number recog- news. JSentence recognized sentence boundaries with a precision nition based on regular expressions and date recognition based of 99.06 % at F P ≈ F N . JSentence uses a rule-based algorithm on a custom Java component, regular expressions and a list with 15 regular expression based rules and 4 abbreviation lists. of month names. 2. JNER assigns a confidence level to every recognized NE, which is calculated by summing the pre-defined values of 4. APPLICATION EXAMPLES matching rules.3 Over confidence level 0.95, an annotation In this section two demo applications of Docuphet are presented. suggestion is created with the following RDF triple: BioBase is a web site to collect biographies of known people (sim-
ilar to the ones in [18]), user autobiographies, or simple self-intro- ductory texts. FlatBase is a real estate advertisement portal. In the <[NE value]>. case of the biographies, Docuphet is configured to recognize IFs where NE type and NE value are substituted with the actual based only on the entered text. For the FlatBase portal, the en- values. The annotation suggestions also have level of con- tered text is analysed first, and when some relevant information are fidence4 that is set to 0.95. DCE accepts every annotation still missing (e.g. floor number) Docuphet automatically asks ques- above 0.9 without asking a confirming question from the tions from the user to complete the advertisement database prop- user, in order to avoid of flushing the user with questions.5 erly. Both applications are configured to work on Hungarian text. The target of the annotations (the location where the annota- tion is placed) is the node before the given paragraph. From The workflow is the same in both scenarios as described in Sec- these annotations, a tag cloud can be generated with typed tion 2.1. Both applications uses DCE and the same document server (see also Figure 4). component versions. They differ in the way how the annotations are 3. For the top three person and location NEs with confidence produced, thus we discuss this part in detail next. level between 0.8 and 0.95, an annotation suggestion of value 3 4.1 BioBase The creator of a rule can set both positive and negative confidence coefficients. In BioBase a two layered IF recognition has been implemented. In 4 This is based on the NE confidence level, but can be weighted, e.g. the first layer NEs are identified as follows: when a candidate person NE shares the surname with an already recognized NE. 1. JNER analyses the text configured with all the NER rules 5 Every annotation can be easily removed by deletion or with the available for Hungarian language. Currently this includes undo action. Figure 5: A screenshot from FlatBase. Figure 4: A screenshot from BioBase tion of the advertisement are attempted to be identified: 0.8 is created with the question “This article is related to the person/location (value)” with the same target and RDF triple • type of property: house, condominium, apartment etc., as in the previous step. • type of offer: for rent, for sale, • location: countryside, city, Based on the NEs recognized, the following IE attempts are made: • price range, • size, 1. First, the person is identified whom the article goes. This is • contact information. done by creating suggestions on the first person NE found (in the title or in the text) with the triple When a part of main information becomes available, the details are
attempted to be extracted. Some examples: <[NE value]>. where the target is the beginning of the article, the confidence • details about the location (district or quarter, street name); level is 0.8. This is repeated with the subsequent person NEs • building materials; until a positive answer is given by the user. (In our expe- • in case of flats: on which particular floor is the property; is rience the article is nearly always about the first mentioned this the top/ground floor; person). • for certain types of property: orientation (street, garden); 2. If the subject of the article is known, the date and place of • for a house: size of the garden; birth and death are attempted to be identified next. This is • type of heating (depends on the property type); done by analysing the date and location tokens in range of • arrangement of rooms (e.g. separated entrance); a centered window of 15 tokens around the occurrences the • public transport facilities (different types, based on the city). identified person. Extra confidence is added if the verb is in past tense form “született” (was born), “elhunyt | meghalt” On both levels, specially configured JNER instances are used. The (died) around a particular date or location (these are often configuration initially includes a shorter list of locations, regular omitted however). expression based rules to recognize price, size, and contact infor- 3. Nationality, profession, education, residence, friends and par- mation. IFR is then performed based on already known NEs and ent’s name are identified in a similar way as described in the certain trigger words related to building materials, heating types, previous point, involving certain terms (his mother | father) etc. When some information is extracted, the JNERs may be recon- where appropriate. figured accordingly (e.g. loading the corresponding quarters when a district of Budapest is found). Using these annotations the persons can be categorized by the era, which of they lived or live in, the profession, the nationality or lo- When some of the main information are missing from the ad by cation. All the RDF properties used in the process are in BioBase’s saving, the user is asked to complete.6 If this happens again, a namespace, but if required, they can be easily mapped into other limited number of direct questions are formulated related to the namespaces, like FOAF [12]. missing information, every question only once. At the end, the ad is saved even if some information are still missing. Our experience with BioBase shows that Docuphet is able to cap- ture the basic biographic information about a person. When creat- FlatBase has proven to be very effective because flat advertisements ing a 500 word long biography (see Figure 4), the system pops up are usually short, very similar to each other, and their vocabulary 10–15, mostly adequate questions. One direction of further work is limited. As a result, Flatbase is capable of extracting nearly all could be the to addition of new event IFs based on the review of (usually not more than 5–10, see Figure 5) important information biographies. from an advertisement. This suggest that FlatBase may become a user-friendly input-assisting tool for advertisement portals. 4.2 FlatBase 5. DISCUSSION IFR performed in FlatBase also in two layers, but in a completely 6 different way than in BioBase. On the first level, the main informa- But the actual content is stored anyway. Bodain and Robert composed five requirements on the static prop- of text, then many suggestions may be generated. These must erties of semantic annotations [3]: be asked in several turns. • robust anchors, 6. RELATED WORK • transparency, In the last 15 years, an excess of semantic annotation tools had • freedom in choosing the semantic vocabulary, been developed. Here we recall and compare the most important • variable granularity, ones (see also Table 1). • handling dynamic updates. We can divide the semantic annotators into two main groups: For dialogue-assisted annotators, such as Docuphet, most of the above requirements can be carried over7 , but as an outcome of our • Semantic Wikis: One form of inserting semantic annota- experimentation, we can now formulate additional requirements on tions into documents is via semantic wikis, such as Semantic the semantic annotation creation: Mediawiki [23], Artificial Memory [21], Kaukolu [8], PHP- Wiki [36], IkeWiki [30], and SWiM [19]. These applications 1. A particular question must never be asked twice to avoid of enable the user to input RDF data by using a special syn- the discontent of the user. This implies that: tax. The available semantic vocabulary and the granularity 2. Every suggestion must be stored in the document, regardless of the annotations vary in these applications, but in all cases from the answer, if any. Per-session storage is not sufficient semantic handling skill is required from the user. since the document can be edited in several sessions. Further • Desktop ontology builders and annotators: This group research is needed to find out whether suggestions should be contains some feature-rich desktop annotators for authoring stored on a per-document or a per-user basis. semantically annotated documents. Protégé [28], TopBraid 3. There are two main types of suggestions depending on the [35], Amaya [29], and Mangrove [22] are frameworks for positive/negative answer of the user. As a future work it building ontologies and knowledge graphs. SWEDT [27], should be investigated how negative answers can be exploited Apolda [41], and KATIA [3] have rich document editing and at annotation. annotating capabilities. These are professional tools for know- 4. According to the “handling dynamic updates” rule described ledge experts. in [3], the annotations must be re-validated upon every text S-CREAM [15] integrates the Amilcare [4] IE module that change. Given our point 1, this requirement can hardly be implements a semi-supervised machine learning method: a met. Theoretically it is possible to insert a statement into set of training data must be annotated by hand in advance, the first part of a document which negates the meaning of then Amilcare creates certain annotations automatically in everything in a given scope. To handle this appropriately, we new documents. should either fully capture the meaning of the change and COHSE [2] highlights text and provide additional informa- update the right annotations or, re-ask every question. The tion for strings matching elements of a pre-defined knowl- first solution is yet not possible to carry out, the latter causes edge base. Magpie [10] allows annotation of a pre-defined a high number of undesirable questions. To get around this set of concepts based on forms. problem, we devised two techniques: • Limited scope: We have defined two types of annota- Docuphet has many in common in the visualization of annotations tions: basic and derived. Basic annotations regard only with SWEDT [27] and KATIA [3]. But unlike these tools, Docu- to a specific text scope (a title, a paragraph, list item). phet’s editor hides the annotation markup details from the user and We assume that changes outside the scope do not af- provides annotation visualization instead. fect them. To fulfil this assumption, basic annotations are very simple e.g. “This paragraph is related to Bu- Like Melita [32], AKTiveDoc [20], MnM [37], or S-CREAM [15] dapest” when the token Budapest is present. These an- Docuphet also uses IE technology to extract semantic annotation notations are usually retrieved by NER as described in candidates from the text. However, the way Docuphet uses IE is Section 4.1, and re-validated upon text change in their quite dissimilar from these tools, since it uses IFs as a common scope. concept of semantic information and involves the user into the pro- • Dependencies: We define a dependency graph of anno- cess. tations. Derived annotations depend on basic or other derived annotations. The dependency is tracked with a Like COHSE and Magpie, Docuphet is based on pre-defined con- list of annotation IDs. Every time an annotation changes, cepts and relations that are termed information frames. The tar- its dependants should be re-validated. If an annotation geted audience of unskilled users and the NLP- and IE-based dia- is deleted, the derived annotations must be deleted as logue-assisted annotation creation render our solution rather unique. well. Furthermore, a derived annotation may require Our approach is comparable to the question-sequence-based guid- a re-validation when non-annotated parts of the text ance provided by some complex installation wizards, where the change, since some derived annotations may depend on questions are based on the information gathered earlier. In Docu- the characteristics of the text or on missing annotation. phet the set of IFR elements represents the knowledge base, which For example if the user accepts a new “main category” provides a sophisticated, flexible and order-independent solution. annotation, the old one must be re-validated or simply deleted. 7. CONCLUSION AND FUTURE WORK 5. The number of questions asked together has to be limited. Docuphet is dialogue-assisted semantic text annotator. The com- This is important because if the user pastes in a larger piece puter–human dialogue is facilitated by IE techniques: named en- 7 tity recognition and information frame extraction. In Docuphet— although we applied a pre-defined semantic vocabulary because of the nature of the system unlike some semantic wikis—it is not possible to annotate the text Table 1: Comparing Docuphet to other solutions Name Platform Editor type Storage format Semantic annotation input Semantic vocabulary Way of storage Skills required WikiPedia Web textarea wikitext – – – – Semantic MediaWiki Web textarea wikitext markup arbitrary special wikitext ontologies, markup SWEDT Eclipse source code editor HTML form+RDF source arbitrary RDF Web development, ontologies Katia desktop Java word processor HTML Drag-and-Drop predefined RDF server ontologies Amaya desktop application source code editor HTML forms arbitrary RDF server Web development, ontologies, RDF Melita desktop application word processor HTML forms+named entity suggestionsa predefined RDF database ontologies Docuphet Web What You See is What You Need Docbook suggestions in natural language predefined RDFa – a It collects all named entities as potential instances of ontology classes and build the corresponding ontology simultaneously. Therefore The relevance of the questions can be improved if topical category Docuphet is only capable of handling pre-defined RDF information labels are available for the documents. Therefore we plan to pre- triples, which limits the flexibility of the system. On the other hand, pare the Docuphet to collaborate with document classifiers, such as this very property allows to compose easy-to-understand questions e.g. the hitec3 framework [16]. about the known triples, as the questions are defined together with the corresponding IFs. This way it is easy to create annotations Acknowledgement even for the completely uninitiated users. Domonkos Tikk was supported by the Alexander von Humboldt Foundation. Given these properties, Docuphet is most useful when the domain of the text is known in advance. Two exemplary applications were presented in Section 4. Other possible applications include anno- 8. REFERENCES [1] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. The berkeley tation of economic or sport news, product reviews or geolocation framenet project. In Proceedings of the 17th international conference reviews. In these cases the set of appropriate IFs and corresponding on Computational linguistics, pages 86–90, Morristown, NJ, USA, IFR rules have to be created in advance. 1998. Association for Computational Linguistics. [2] S. Bechhofer, C. Goble, L. Carr, and S. Kampa. COHSE: Semantic Despite these limitations, there is a basic functionality available web gives a better deal for the whole web? ISWC International without specific domain knowledge. Docuphet is capable of rec- Semantic Web Conference Poster, 2002. ognizing NEs in an arbitrary text, and formulating questions about [3] Y. Bodain and J.-M. Robert. Developing a robust authoring the NE candidates. This makes it a very useful tool for building annotation system for the semantic web. Proc. of 7th IEEE Int. Conf. NE databases and for disambiguation applications. Another possi- on Advanced Learning Technologies, 2007. ble application area is the assistance of context sensitive browser [4] F. Ciravegna, A. Dingli, Y. Wilks, and D. Petrelli. Amilcare: adaptive applications tools, such as In4’s iGlue [17] and Context Discovery information extraction for document annotation. SIGIR’02: Proc. of Inc.’s Context Organizer [6]. the 25th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 367–368, 2002. As for the future work, we intend to enable Docuphet to access [5] Citizendium. en.wikipedia.org/wiki/Citizendium. and edit wikipedia articles via the interface provided by the Medi- [6] Context Discovery Inc. Context Organizer for the Web. aWiki’s public API. As wikipedia uses the wikitext format, being http://www.contextdiscovery.com/ very different from Docbook/RDFa, apparently the most problem- context-organizer-for-the-web.aspx. atic is the conversion of the articles, and the placement of the the [7] R. Corderoy. troff. www.troff.org/. annotations in wikitext. We also plan to integrate Docuphet with [8] DFKI Knowledge Management. Kaukolu. large public databases like IMDB, to facilitate disambiguation and www.dfki.de/web/forschung/km/. named entity recognition. [9] The Docuphet project. www.docuphet.net. We think that in machine understanding, bidirectional communi- [10] J. Domingue, M. Dzbor, and E. Motta. Semantic layering with cation — questions and answers — is a key element — just like magpie. In Handbook on Ontologies, pages 533–554. Springer, 2004. in human understanding. However, we admit that if the questions [11] FCKEditor. www.fckeditor.net/. are not relevant enough, this proactive behavior probably causes [12] The Friend Of A Friend project. www.foaf-project.org/. discontent on the user’s part. To find out more about the users’ re- [13] Free Software Foundation. texinfo. actions when using our system, we intend to conduct experiments www.gnu.org/software/texinfo/. and surveys with many users. [14] GRDDL Working Group. Gleaning resource descriptions from dialects of languages. www.w3.org/TR/grddl/. [15] S. Handschuh, S. Staab, and F. Ciravegna. S-cream-semi-automatic creation of metadata. Proc. of the European Conf. on Knowledge Acquisition and Management, 2002. [16] HITEC. categorizer.tmit.bme.hu/trac/wiki. [17] In4 Ltd. The iGlue project. http://iglue.com/beta/. [18] Á. Kenyeres. Magyar Életrajzi Lexikon ((Hungarian Biography Encyclopedia)). Arcanum Adatbázis Kft, 1994. [19] KWARC. SWiM: A semantic wiki for mathematical knowledge management. kwarc.info/projects/swim/. [20] Vitaveska Lanfranchi, Fabio Ciravegna, Phil Moore, and Daniela Petrelli. Document editing and browsing in aktivedoc. In DocEng ’05: Proceedings of the 2005 ACM symposium on Document engineering, pages 237–238, New York, NY, USA, 2005. ACM. [21] L. Ludwig. Artificial memory. www.artificialmemory.net/. [22] L. McDowell, O. Etzioni, S. D. Gribble, A. Y. Halevy, H. M.Levy, W. Pentney, D. Verma, and S. Vlasseva. Mangrove: Enticing ordinary people onto the semantic web via instant gratification. Proc. of International Semantic Web Conference, pages 754–770, 2003. [23] Semantic Mediawiki. semantic-mediawiki.org/wiki/ Semantic\%5FMediaWiki. [24] Nupedia. en.wikipedia.org/wiki/Nupedia. [25] OASIS DITA Technical Committee. Darwin information typing architecture. www.oasis-open.org/committees/dita. [26] OASIS Relax-NG committee. Relax-ng. www.oasis-open.org/committees/relax-ng. [27] R. G. Pereira and M. M. Freire. SWedt: A semantic web editor integrating ontologies and semantic annotations with resource description framework. IEEE Int. Conf. on Internet and Web Applications and Services, pages 200–200, 2006. [28] Protégé. protege.stanford.edu/. [29] V. Quint and I. Vatton. An introduction to Amaya. Wide Web J., 1997. [30] Salzburg Research. IkeWiki. ikewiki.salzburgresearch.at/. [31] Szegedi Tudományegyetem Nyelvtechnológiai Csoport. Szeged korpusz 2. www.inf.u-szeged.hu/projectdirs/hlt/. [32] Advanced Knowledge Technologies. Melita. howhttp://www.aktors.org/technologies/melita/. [33] D. Tikk. Szövegbányászat, chapter 2. TypoTEX, 2007. [34] Tiny Moxiecode Content Editor (TinyMCE). tinymce.moxiecode.com/. [35] TopQuadrant. Topbraid. www.topquadrant.com/topbraid/ composer/index.html. [36] VA Linux Systems. PhpWiki. phpwiki.sourceforge.net/. [37] Maria Vargas-Vera, Enrico Motta, John Domingue, Mattia Lanzoni, Arthur Stutt, and Fabio Ciravegna. Mnm: Ontology driven semi-automatic and automatic support for semantic markup. In EKAW ’02: Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web, pages 379–391, London, UK, 2002. Springer-Verlag. [38] W3C CDF Working Group. Compound document format. www.w3.org/2004/CDF/. [39] W3C Semantic Web Activity. Rdf/xml syntax specification. www.w3.org/TR/rdf-syntax-grammar/. [40] W3C Semantic Web Activity. Rfda. www.w3.org/TR/xhtml-rdfa-primer/. [41] C. Wartena, R. Brussee, L. Gazendam, and W.-O. Huijsen. A practical tool for semantic annotation. IEEE 18th Int. Conf. on Database and Expert Systems Applications (DEXA), 2007. [42] wikitext. en.wikipedia.org/wiki/Wikitext.