MnM: Ontology-Driven Tool for Semantic Markup


       Maria Vargas-Vera1 and Enrico Motta1 and John Domingue1 and Mattia Lanzoni1 and Arthur Stutt1 and
                                              Fabio Ciravegna2

                                                                                agents able to exchange information and carrying out
                                                                                complex problem solving on the web is based on the
                                                                                assumption that these agents will share common, explicitly
          Abstract. An important precondition for realising
          the goal of a semantic web is the ability to
                                                                                defined, generic conceptualizations. These are typically
          annotate    web     resources    with     semantic                    models of a particular area, such as product catalogues, or
          information. In order to carry out this task, users                   taxonomies of medical conditions, although ontologies can
          need appropriate representation languages,                            also be used to support the specification of reasoning
          ontologies, and support tools. In this paper we                       services ([18], [20] and [8]), thus allowing not only ‘static’
          present MnM, an annotation tool which provides                        interoperability through shared domain conceptualizations,
          both automated and semi-automated support for                         but also ‘dynamic’ interoperability through the explicit
          annotating web pages with semantic contents.                          publication of competence specifications, which can be
          MnM integrates a web browser with an ontology
                                                                                reasoned about to determine whether a particular semantic
          editor and provides open APIs to link to ontology
          servers and for integrating information extraction
                                                                                web service is appropriate for a particular task.
          tools. MnM can be seen as an early example of                         Ontologies and representation languages provide the basic
          the next generation of ontology editors, being                        semantic tools to construct the semantic web. Obviously a
          web-based, oriented to semantic markup and                            lot more is needed; in particular, tool support is needed to
          providing mechanisms for large-scale automatic                        facilitate the development of semantic resources, given a
          markup of web pages.
                                                                                particular ontology and representation language. This
                                                                                problem is not a new one, knowledge engineers early on
                                                                                realised that one of the main obstacles to the development
                                                                                of intelligent, knowledge-based systems was the so-called
1        INTRODUCTION
                                                                                knowledge acquisition bottleneck ([7]). In a nutshell, the
                                                                                problem is how to acquire and represent knowledge, so that
An important pre-condition for realising the goal of the                        this knowledge can be effectively used by a reasoning
semantic web is the ability to annotate web resources with                      system. Although the problem is not a new one, the context
semantic information. In order to carry out this task, users                    provides by the semantic web introduces new aspects to the
need appropriate knowledge representation languages,                            problem, with respect to the nature of the knowledge and
ontologies, and support tools. The knowledge                                    the type of users.
representation language provides the semantic interlingua
for expressing knowledge precisely. RDF ([11] and [16])                         Nature of the knowledge. Traditional knowledge
and RDFS [2] provide the basic framework for expressing                         acquisition was concerned with knowledge for problem
metadata on the web, while current developments in web-                         solving. Semantic markup will primarily focus on ontology
based knowledge representation, such as DAML+OIL                                population, a far easier knowledge acquisition task.
(DAML+OIL, 2001) and WebOnt (http://www.w3.org),                                Type of users. Knowledge-based systems are normally
are building on the RDF base framework to provide more                          written by skilled knowledge engineers. On the web, it is
sophisticated knowledge representation support. Ontologies                      likely that semantic marking up will become a common
([9]) provide the mechanism to support interoperability at a                    activity, carried out by content providers who are not
conceptual level. In a nutshell, the idea of interoperating                     necessarily skilled knowledge engineers. This means that


1
    The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK.
Email:{m.vargas-vera; e.motta; j.b.domingue; m.lanzoni; a.stutt}@open.ac.uk


2
    Department of Computer Science, University of Sheffield, Regent Court 211
Portobello Street, Sheffield S1 4DP, UK. Email:f.ciravegna@dcs.shef.ac.uk
more emphasis will have to be put on facilitating semantic          •   Extract. An IE mechanism is selected and run over
markup by ‘ordinary’ web users (people who are neither                  a set of documents
experts in language technologies nor 'power knowledge           We will now provide more details of each of the above
engineers'). In particular, automated knowledge extraction      activities in turn.
technologies are likely to play an ever increasing important
role, as a crucial technology to tackle the semantic web        Browse
version of the knowledge acquisition bottleneck.                In this activity the user browses a library of knowledge
In this paper we present MnM, an annotation tool which          models which sit on a web based ontology server. The user
provides both automated and semi-automated support for          can see an overview of the existing models and can select
marking up web pages with semantic contents. MnM                which one to focus on (i.e., which ontology to use to
integrates a web browser with an ontology editor and            initiate the markup process). Within a selected ontology the
provides open APIs to link to ontology servers and for          user can browse the existing items - for example the classes
integrating information extraction tools. MnM can be seen       and instances. Items within an ontology can be selected as
as an early example of the next generation of ontology          the starting point for selecting an IE mechanism. More
editors, being web-based, oriented to semantic markup and       specifically, the selected class forms the basis for a
providing mechanisms for large-scale automatic markup of        template which will eventually be matched against a corpus
web pages.                                                      of documents and instantiated in the extraction activity.
The rest of the paper is organised as follows: in the next      Mark-Up
section we will show the process model underlying the           Once a class has been selected a training corpus of
design of the tool. Finally sections 3 and 4 discuss related    manually marked up pages needs to be created. Here the
work and re-state the main tenets and results from our          user views appropriate documents within MnM’s built-in
research.                                                       web browser and annotates segments of text using the tags
                                                                based on the class’s slot as given in the ontology (i.e.,
                                                                ontology driven mark-up). As the text is selected MnM
                                                                inserts the relevant XML tags into the document.
2    PROCESS MODEL
                                                                Learning
Within this work we have focused on creating a generic          MnM integrates web browsing, ontology browsing and IE
process model for developing semantically enriched web          development. It does not have a built-in IE tool but
content. The component tools which are used in MnM are          provides a plug-in interface which allows the integration of
ontology servers, Information Extraction (IE) tools and         IE tools easily.
augmented web browsers. During our initial work in this         In a previous version of our MnM we integrated Marmot,
area we found that either the existing tools did not directly   Badger and Crystal from the University of Massachusetts
support the creation of semantic web content or the             ([21]) and our own NLP components (i.e., OCML
mapping between the tasks to be carried out and the toolset     preprocessor). A full description of this version can be
was non-trivial. Hence, within MnM, we adopted a generic        found in ([23] and [24]). However, in this paper we will
process model, which can be easily understood by web            concentrate on the recent integration work that we have
developers who are not necessarily expert ontology              carried out with Amilcare, a tool for adaptive information
engineers or human language technology experts.                 extraction ([3]).
   Another key feature of our process model is that it is
generic with respect to the specific ontology server and IE     Amilcare is designed to support active annotation of
technologies used.                                              documents. It performs IE by enriching texts with XML
There are five main activities supported by MnM:                annotations. To use Amilcare in a new domain the user
                                                                simply has to manually annotate a training set of
    •    Browse. A specific set of knowledge components         documents. No knowledge of           Natural Language
         is chosen from a library of knowledge models on        Technologies is necessary.
         an ontology server.
                                                                Amilcare is designed to accommodate the needs of
    •    Markup. The chosen set of knowledge                    different user types. While naïve users can build new
         components is selected to form the basis of an IE      applications without delving into the complexity of Human
         mechanism. A corpus of documents are manually          Language Technology, IE experts are provided with a
         marked up.                                             number of facilities for tuning the final application.
    •    Learn. A learning algorithm is run over the            Induced rules can be inspected, monitored and edited to
         marked up corpus to learn the extraction rules.        obtain some additional accuracy, if required. The interface
                                                                also allows precision (P) and recall (R) to be balanced.
    •    Test. The IE mechanism is run over a test corpus       The system can be run on an annotated unseen corpus and
         to assess its precision and recall measures.           users are presented with statistics on accuracy, together
with details on correct matches and mistakes. Retuning the       new ones. The output of the training phase is a collection
P&R balance does not generally require major retraining,         of rules for IE that are associated with the specific scenario.
facilities for inspecting the effect of different P&R balances
are provided. Although the current interface for balancing
P&R is designed for IE experts, a future version will            Testing
provide support for naive users ([6]).                           MnM provides various mechanisms for selecting a test
                                                                 corpus and distinguish this from a training corpora. The
At the start of the learning phase Amilcare preprocesses
                                                                 user can manually select training and test corpora and these
texts using Annie, the shallow IE system included in the
                                                                 can be in the form of local files or on the web. In addition,
Gate package ([17], www.gate.ac.uk). Annie performs text
                                                                 it is also possible to simply select a corpus (either locally or
tokenization (segmenting texts into words), sentence
                                                                 on the web) and let the system to create, to test and training
splitting (identifying sentences) part of speech tagging
                                                                 corpora randomly.
(lexical disambiguation), gazetteer lookup (dictionary
lookup) and named entity recognition (recognition of             Extraction
people and organization names, dates, etc.).                     After the training phase Amilcare has a library of induced
Amilcare then induces rules for information extraction. The      rules which can be used to extract information from texts.
learning system is based on LP2, a covering algorithm for        When working in extraction mode, Amilcare receives as
supervised learning of IE rules based on Lazy-NLP ([3]           input a (collection of) text(s) with the associated scenario –
and [4]). This is a wrapper induction methodology ([15])         scenario is the set of tags that the user will insert in the
that, unlike other wrapper induction approaches, uses            training corpora- (including the rules induced during the
linguistic information in the rule generalization process.       training phase). It preprocesses the text(s) by using Annie
The learning system starts inducing wrapper-like rules that      and then it applies its rules and returns the original text
make no use of linguistic information, where rules are sets      with the added annotations. The Gate annotation schema is
of conjunctive conditions on adjacent words. Then the            used for annotation ([17]).
linguistic information provided by Annie is used in order to     Once that is done the information extracted is presented to
create generalised rules: conditions on words are                the user for approval. Then the extracted information is
substituted with conditions on the linguistic information        sent to the ontology server which will populate the selected
(e.g. condition matching on either the lexical category, or      ontology.
the class provided by the gazetteer, etc. ([4]).
All the generalizations are tested in parallel by using a           During the population step the IE mechanism fills
variant of the AQ algorithm ([19]) and the best                  predefined slots associated with an extraction template.
generalizations are kept for IE. The idea is that the            Each template consists of slots of a particular class as
linguistic-based generalisation is deployed only when the        defined in the selected ontology, for instance, the class
use of NLP information is reliable or effective. The             visiting-a-place-or-people has the slots: visitor, place, etc.
measure of reliability here is not linguistic correctness, but      Our goal is to automatically fill as many slots as
effectiveness in extracting information using linguistic         possible. However, some of the slots may still require
information as opposed to using shallower approaches.            manual intervention. There are several reasons for this
Lazy NLP-based systems learn which is the best strategy          problem:
for each information/context separately. For example they           • there is information that is not contained in the text,
may decide that using the result of a part of speech tagger         • none of the rules from our IE libraries match with
is the best strategy for recognising the speaker in seminar             the sentence that might provide the information
announcements, but not to spot the seminar location. This               (incomplete set of rules). This means that the learning
strategy is quite effective for analysing documents with                phase needs to be tuned.
mixed genres, a common situation in web documents ([5]).         The extracted information is also validated using the
                                                                 ontology. This is possible because each slot in each class
The learning system induces two types of rules: tagging
                                                                 of the ontology has a type associated with it. Therefore,
rules and correction rules. A tagging rule is composed of a
                                                                 extracted information which does not match the type
left hand side, containing a pattern of conditions on a
                                                                 definition of the slot in the ontology can be highlighted as
connected sequence of words, and a right hand side that is
                                                                 incorrect.
an action inserting an XML tag in the texts. Correction
rules shift misplaced annotations (inserted by tagging rules)
to the correct position. These are learnt from the errors
found whilst attempting to re-annotate the training corpus
using the induced tagging rules.
Correction rules are identical to tagging rules, but (1) their
patterns also match the tags inserted by the tagging rules
and (2) their actions shift misplaced tags rather than adding
3    RELATED WORK                                               contents. The first prototype of the system has now been
                                                                completed and tested with both Amilcare and the UMass
A number of annotation tools for producing semantic             set of tools. The early results are encouraging in terms of
markup exist. The most interesting of these are Annotea         the quality and robustness of our current implementation,
([13]); SHOE Knowledge Annotator ([12]); the COHSE              however, there is clearly a lot more work needed to make
annotator ([1]); AeroDAML ([14]); and, OntoMat, a tool          this technology easy to use for our target user base (people
being developed using the CREAM annotation framework            who are neither experts in language technologies nor
([10]). A commercial version of OntoMat is available as         'power knowledge engineers'). In particular, all the
OntoAnnotate                                                    activities associated with automated markup tend to be very
(http://www.ontoprise.de/com/co_produ_tool2.htm).               sensitive to the quality of markup and to the
                                                                appropriateness of the chosen corpora. Amilcare already
Annotea provides RDF-based markup but it does not               attempts to address some of these issues through its
support information extraction nor is it linked to an           adaptive mechanisms, however, more work is needed in
ontology server. It does, however, have an annotation           this area. In addition, we also plan to do more work on the
server which makes annotations publicly available. SHOE         user interface, in particular with respect to the integration
Knowledge Annotator allows users to mark up pages in            of markup, ontology browsing and the 'semantic
SHOE guided by ontologies available locally or via a URL.       navigation' of web pages. Currently, ontology and web
These marked up pages can be reasoned about by SHOE-            browsing are integrated with respect to contents annotation,
aware tools such as SHOE Search. The COHSE annotator            but ontologies do not inform the web browsing component
uses an ontology server to mark up pages in DAML+OIL.           of MnM directly. Our vision for the semantic web is one in
The results can be saved as RDF. AeroDAML is available          which new forms of 'conceptual navigation' will emerge,
as a web page. The user simply enters a URL and the             where association between resources will be semantic as
system automatically returns DAML annotations on a web          well as hypertextual. We plan to experiment with these
page using a predefined ontology based on WordNet.              ideas and extend the interface of MnM to support novel,
Of the systems listed above, OntoMat is closest to MnM          markup-driven forms of web browsing, as well as the
both in spirit and in functionality. Both can provide some      standard HTML based ones.
form of automated extraction. However, while MnM makes
it possible to access ontology servers through APIs, such as
OKBC, and also to access ontologies specified in a markup
format, such as RDF and DAML+OIL, OntoMat only
provides the latter functionality. In contrast with OntoMat,
MnM can handle multiple ontologies at the same time,            ACKNOWLEDGEMENTS
which makes it very easy to switch from one to another,
and also allows inherited definitions to be displayed for       This work was funded by the Advanced Knowledge
ontology editing and browsing. On the other hand,               Technologies       (AKT)      Interdisciplinary     Research
OntoMat can store pages annotated in DAML+OIL using             Collaboration (IRC), which is sponsored by the UK
OntoBroker as an annotation server. It also provides            Engineering and Physical Sciences Research Council under
crawlers which can search the Web for marked up pages           grant number GR/N15764/01. The AKT IRC comprises the
for addition to its internal knowledge base.                    Universities of Aberdeen, Edinburgh, Sheffield,
While both MnM and OntoMat are very similar they                Southampton and the Open University. The authors would
illustrate a slight difference of emphasis in providing tools   like to thank Maruf Hassan and Simon Buckingham Shum
for the Semantic Web. While OntoMat adopts the                  for their invaluable help in reviewing the first draft of this
philosophy that the markup which indicates the knowledge        paper.
content of a web resources should be included as part of
that resource, MnM’s annotations are stored both as
markup on a page and as items in a knowledge base held
on the WebOnto combined ontology and knowledge base             REFERENCES
server.
                                                                [1] S. Bechhofer and C. Goble, Towards Annotation Using
                                                                   AML+OIL. First International Conference on Knowledge
                                                                   Capture (K-CAP 2001). Workshop on Semantic Markup and
4    CONCLUSIONS                                                   Annotation. Victoria, BC., Canada, October 2001.

In this paper we have described MnM, an ontology-based          [2] D. Brickley and R. Guha, Resource Description Framework
annotation tool which provides both automated and semi-           (RDF) Schema Specification 1.0. Candidate recommendation,
automated support for annotating web pages with semantic
   World     Wide     Web     Consortium,    2000.   URL:               and Semantic Annotation, Victoria, B.C., Canada, October
   http://www.w3.org/TR/2000/CR-rdf-schema-20000327.                    2001.
                                                                     [15] N. Kushmerick, D. Weld and R. Doorenbos, Wrapper
[3] F. Ciravegna, Adaptive Information Extraction from Text by          induction for information extraction, Proc. of 15th
   Rule Induction and Generalisation, Proc. of 17th International       International Conference on Artificial Intelligence, IJCAI-97.
   Joint Conference on Artificial Intelligence (IJCAI 2001) ,        [16] O. Lassila and R. Swick, Resource Description Framework
   Seattle, August 2001.                                                (RDF): Model and Syntax Specification. Recommendation,
[4] F. Ciravegna, LP2 an Adaptive Algorithm for Information             World Wide Web                    Consortium, 1999. URL:
   Extraction from Web-related Texts. Proc. of the IJCAI-2001           http://www.w3.org/TR/REC-rdf-syntax/.
   Workshop on Adaptive Text Extraction and Mining held in            [17] D. Maynard, V. Tablan, H. Cunningham, C. Saggion K.
   conjunction with the 17th International Conference on                Bontcheva and Y. Wilks, Architectural Elements of Language
   Artificial Intelligence (IJCAI-01), August, 2001.                    Engineering Robustness. Journal of Natural Language
[5] F. Ciravegna, Challenges in Information Extraction from             Engineering – Special Issue on Robust Methods in Analysis of
    Text for Knowledge Management in IEEE Intelligent Systems           Natural Language Data ,forthcoming, 2002.
   and Their Applications, November 2001, (Trend and                 [18] S. McIlraith, T. C. Son and H. Zeng, Semantic Web Services,
   Controversies).                                                      IEEE Intelligent Systems, Special Issue on the Semantic
[6] F. Ciravegna and D. Petrelli, User Involvement in Adaptive          Web, Volume 16, No. 2, pp. 46-53, March/April, 2001.
   Information Extraction: Position Paper in Proceedings of the      [19] R. S. Mickalski, I. Mozetic, J. Hong, H. Lavrack, The multi
   IJCAI-2001 Workshop on Adaptive Text Extraction and                  purpose incremental learning system AQ15 and its testing
   Mining held in conjunction with the 17th International               application to three medical domains', in Proceedings of the
   Conference on Artificial Intelligence (IJCAI-01), August,            5th National Conference on Artificial Intelligence,
   2001.                                                                Philadelphia. Morgan Kaufmann publisher, 1986.
[7] E. A. Feigenbaum, The art of artificial intelligence 1: Themes   [20] E Motta, Reusable Components for Knowledge Models. IOS
   and case studies of knowledge engineering. Technical report,         Press, Amsterdam, 1999.
   Pub. no. STAN-SC-77-621, Stanford University, Department          [21] E. Riloff, An Empirical Study of Automated Dictionary
   of Computer Science, 1977.                                           Construction for Information Extraction in Three Domains.
[8] D. Fensel, D. and E. Motta, Structured Development of               The AI Journal, 85, 101-134, 1996.
   Problem Solving Methods. Transactions on Knowledge and            [22] S. Staab, A. Mädche and S. Handschuh, An Annotation
   Data Engineering 13(6):9131-932, 2001.                               Framework for the Semantic Web. In: S. Ishizaki (ed.), Proc.
[9] T. R. Gruber, A Translation Approach to Portable Ontology           of The First International Workshop on MultiMedia
   Specifications.Knowledge Adquisition 5(2), 199-220, 1993.            Annotation. January, 30 - 31, 2001. Tokyo, Japan.
[10] S. Handschuh, S. Staab and A. Maedche, CREAM- Creating          [23] M. Vargas-Vera, J. Domingue, Y. Kalfoglou, E. Motta and S.
   relational metadata with a component-based, ontology-driven          Buckingham-Shum, Template-driven information extraction
   annotation framework. First International Conference on              for populating ontologies. Proc of the IJCAI'01 Workshop on
   Knowledge Capture (K-CAP 2001), Victoria B.C., October               Ontology Learning, Seattle, WA, USA 2001.
   2001.                                                             [24] M. Vargas-Vera, E. Motta, J. Domingue, S. Buckingham
[11] P. Hayes, RDF Model Theory, W3C Working                            Shum and M. Lanzoni, Knowledge Extraction by using an
      Draft, February 2002 URL: http://www.w3.org/TR/rdf-mt/.           Ontology-based Annotation Tool. First International
[12] J. Heflin and J. Hendler, A Portrait of the Semantic Web           Conference on Knowledge Capture (K-CAP 2001). Workshop
     in Action. IEEE Intelligent Systems, 16(2), 2001.                  on Knowledge Markup and Semantic Annotation , Victoria
[13] J. Kahan, M. Koivunen, E. Prud’Hommeaux and R. Swick,              B.C., Canada, October 2001.
   Annotea: Open RDF Infrastructure for Shared Web
   Annotations. In Proc. of the WWW10 International
   Conference. Hong Kong, 2001.
[14] P. Kogut and W. Holmes, AeroDAML: Applying
   Information Extraction to Generate DAML Annotations from
   Web Pages. First International Conference on Knowledge
   Capture (K-CAP 2001). Workshop on Knowledge Markup