=Paper=
{{Paper
|id=Vol-100/paper-10
|storemode=property
|title=MnM: Ontology-Driven Tool for Semantic Markup
|pdfUrl=https://ceur-ws.org/Vol-100/Maria_Vargas-Vera-et-al.pdf
|volume=Vol-100
|dblpUrl=https://dblp.org/rec/conf/ecai/Vargas-VeraMDLS02
}}
==MnM: Ontology-Driven Tool for Semantic Markup==
MnM: Ontology-Driven Tool for Semantic Markup
Maria Vargas-Vera1 and Enrico Motta1 and John Domingue1 and Mattia Lanzoni1 and Arthur Stutt1 and
Fabio Ciravegna2
agents able to exchange information and carrying out
complex problem solving on the web is based on the
assumption that these agents will share common, explicitly
Abstract. An important precondition for realising
the goal of a semantic web is the ability to
defined, generic conceptualizations. These are typically
annotate web resources with semantic models of a particular area, such as product catalogues, or
information. In order to carry out this task, users taxonomies of medical conditions, although ontologies can
need appropriate representation languages, also be used to support the specification of reasoning
ontologies, and support tools. In this paper we services ([18], [20] and [8]), thus allowing not only ‘static’
present MnM, an annotation tool which provides interoperability through shared domain conceptualizations,
both automated and semi-automated support for but also ‘dynamic’ interoperability through the explicit
annotating web pages with semantic contents. publication of competence specifications, which can be
MnM integrates a web browser with an ontology
reasoned about to determine whether a particular semantic
editor and provides open APIs to link to ontology
servers and for integrating information extraction
web service is appropriate for a particular task.
tools. MnM can be seen as an early example of Ontologies and representation languages provide the basic
the next generation of ontology editors, being semantic tools to construct the semantic web. Obviously a
web-based, oriented to semantic markup and lot more is needed; in particular, tool support is needed to
providing mechanisms for large-scale automatic facilitate the development of semantic resources, given a
markup of web pages.
particular ontology and representation language. This
problem is not a new one, knowledge engineers early on
realised that one of the main obstacles to the development
of intelligent, knowledge-based systems was the so-called
1 INTRODUCTION
knowledge acquisition bottleneck ([7]). In a nutshell, the
problem is how to acquire and represent knowledge, so that
An important pre-condition for realising the goal of the this knowledge can be effectively used by a reasoning
semantic web is the ability to annotate web resources with system. Although the problem is not a new one, the context
semantic information. In order to carry out this task, users provides by the semantic web introduces new aspects to the
need appropriate knowledge representation languages, problem, with respect to the nature of the knowledge and
ontologies, and support tools. The knowledge the type of users.
representation language provides the semantic interlingua
for expressing knowledge precisely. RDF ([11] and [16]) Nature of the knowledge. Traditional knowledge
and RDFS [2] provide the basic framework for expressing acquisition was concerned with knowledge for problem
metadata on the web, while current developments in web- solving. Semantic markup will primarily focus on ontology
based knowledge representation, such as DAML+OIL population, a far easier knowledge acquisition task.
(DAML+OIL, 2001) and WebOnt (http://www.w3.org), Type of users. Knowledge-based systems are normally
are building on the RDF base framework to provide more written by skilled knowledge engineers. On the web, it is
sophisticated knowledge representation support. Ontologies likely that semantic marking up will become a common
([9]) provide the mechanism to support interoperability at a activity, carried out by content providers who are not
conceptual level. In a nutshell, the idea of interoperating necessarily skilled knowledge engineers. This means that
1
The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK.
Email:{m.vargas-vera; e.motta; j.b.domingue; m.lanzoni; a.stutt}@open.ac.uk
2
Department of Computer Science, University of Sheffield, Regent Court 211
Portobello Street, Sheffield S1 4DP, UK. Email:f.ciravegna@dcs.shef.ac.uk
more emphasis will have to be put on facilitating semantic • Extract. An IE mechanism is selected and run over
markup by ‘ordinary’ web users (people who are neither a set of documents
experts in language technologies nor 'power knowledge We will now provide more details of each of the above
engineers'). In particular, automated knowledge extraction activities in turn.
technologies are likely to play an ever increasing important
role, as a crucial technology to tackle the semantic web Browse
version of the knowledge acquisition bottleneck. In this activity the user browses a library of knowledge
In this paper we present MnM, an annotation tool which models which sit on a web based ontology server. The user
provides both automated and semi-automated support for can see an overview of the existing models and can select
marking up web pages with semantic contents. MnM which one to focus on (i.e., which ontology to use to
integrates a web browser with an ontology editor and initiate the markup process). Within a selected ontology the
provides open APIs to link to ontology servers and for user can browse the existing items - for example the classes
integrating information extraction tools. MnM can be seen and instances. Items within an ontology can be selected as
as an early example of the next generation of ontology the starting point for selecting an IE mechanism. More
editors, being web-based, oriented to semantic markup and specifically, the selected class forms the basis for a
providing mechanisms for large-scale automatic markup of template which will eventually be matched against a corpus
web pages. of documents and instantiated in the extraction activity.
The rest of the paper is organised as follows: in the next Mark-Up
section we will show the process model underlying the Once a class has been selected a training corpus of
design of the tool. Finally sections 3 and 4 discuss related manually marked up pages needs to be created. Here the
work and re-state the main tenets and results from our user views appropriate documents within MnM’s built-in
research. web browser and annotates segments of text using the tags
based on the class’s slot as given in the ontology (i.e.,
ontology driven mark-up). As the text is selected MnM
inserts the relevant XML tags into the document.
2 PROCESS MODEL
Learning
Within this work we have focused on creating a generic MnM integrates web browsing, ontology browsing and IE
process model for developing semantically enriched web development. It does not have a built-in IE tool but
content. The component tools which are used in MnM are provides a plug-in interface which allows the integration of
ontology servers, Information Extraction (IE) tools and IE tools easily.
augmented web browsers. During our initial work in this In a previous version of our MnM we integrated Marmot,
area we found that either the existing tools did not directly Badger and Crystal from the University of Massachusetts
support the creation of semantic web content or the ([21]) and our own NLP components (i.e., OCML
mapping between the tasks to be carried out and the toolset preprocessor). A full description of this version can be
was non-trivial. Hence, within MnM, we adopted a generic found in ([23] and [24]). However, in this paper we will
process model, which can be easily understood by web concentrate on the recent integration work that we have
developers who are not necessarily expert ontology carried out with Amilcare, a tool for adaptive information
engineers or human language technology experts. extraction ([3]).
Another key feature of our process model is that it is
generic with respect to the specific ontology server and IE Amilcare is designed to support active annotation of
technologies used. documents. It performs IE by enriching texts with XML
There are five main activities supported by MnM: annotations. To use Amilcare in a new domain the user
simply has to manually annotate a training set of
• Browse. A specific set of knowledge components documents. No knowledge of Natural Language
is chosen from a library of knowledge models on Technologies is necessary.
an ontology server.
Amilcare is designed to accommodate the needs of
• Markup. The chosen set of knowledge different user types. While naïve users can build new
components is selected to form the basis of an IE applications without delving into the complexity of Human
mechanism. A corpus of documents are manually Language Technology, IE experts are provided with a
marked up. number of facilities for tuning the final application.
• Learn. A learning algorithm is run over the Induced rules can be inspected, monitored and edited to
marked up corpus to learn the extraction rules. obtain some additional accuracy, if required. The interface
also allows precision (P) and recall (R) to be balanced.
• Test. The IE mechanism is run over a test corpus The system can be run on an annotated unseen corpus and
to assess its precision and recall measures. users are presented with statistics on accuracy, together
with details on correct matches and mistakes. Retuning the new ones. The output of the training phase is a collection
P&R balance does not generally require major retraining, of rules for IE that are associated with the specific scenario.
facilities for inspecting the effect of different P&R balances
are provided. Although the current interface for balancing
P&R is designed for IE experts, a future version will Testing
provide support for naive users ([6]). MnM provides various mechanisms for selecting a test
corpus and distinguish this from a training corpora. The
At the start of the learning phase Amilcare preprocesses
user can manually select training and test corpora and these
texts using Annie, the shallow IE system included in the
can be in the form of local files or on the web. In addition,
Gate package ([17], www.gate.ac.uk). Annie performs text
it is also possible to simply select a corpus (either locally or
tokenization (segmenting texts into words), sentence
on the web) and let the system to create, to test and training
splitting (identifying sentences) part of speech tagging
corpora randomly.
(lexical disambiguation), gazetteer lookup (dictionary
lookup) and named entity recognition (recognition of Extraction
people and organization names, dates, etc.). After the training phase Amilcare has a library of induced
Amilcare then induces rules for information extraction. The rules which can be used to extract information from texts.
learning system is based on LP2, a covering algorithm for When working in extraction mode, Amilcare receives as
supervised learning of IE rules based on Lazy-NLP ([3] input a (collection of) text(s) with the associated scenario –
and [4]). This is a wrapper induction methodology ([15]) scenario is the set of tags that the user will insert in the
that, unlike other wrapper induction approaches, uses training corpora- (including the rules induced during the
linguistic information in the rule generalization process. training phase). It preprocesses the text(s) by using Annie
The learning system starts inducing wrapper-like rules that and then it applies its rules and returns the original text
make no use of linguistic information, where rules are sets with the added annotations. The Gate annotation schema is
of conjunctive conditions on adjacent words. Then the used for annotation ([17]).
linguistic information provided by Annie is used in order to Once that is done the information extracted is presented to
create generalised rules: conditions on words are the user for approval. Then the extracted information is
substituted with conditions on the linguistic information sent to the ontology server which will populate the selected
(e.g. condition matching on either the lexical category, or ontology.
the class provided by the gazetteer, etc. ([4]).
All the generalizations are tested in parallel by using a During the population step the IE mechanism fills
variant of the AQ algorithm ([19]) and the best predefined slots associated with an extraction template.
generalizations are kept for IE. The idea is that the Each template consists of slots of a particular class as
linguistic-based generalisation is deployed only when the defined in the selected ontology, for instance, the class
use of NLP information is reliable or effective. The visiting-a-place-or-people has the slots: visitor, place, etc.
measure of reliability here is not linguistic correctness, but Our goal is to automatically fill as many slots as
effectiveness in extracting information using linguistic possible. However, some of the slots may still require
information as opposed to using shallower approaches. manual intervention. There are several reasons for this
Lazy NLP-based systems learn which is the best strategy problem:
for each information/context separately. For example they • there is information that is not contained in the text,
may decide that using the result of a part of speech tagger • none of the rules from our IE libraries match with
is the best strategy for recognising the speaker in seminar the sentence that might provide the information
announcements, but not to spot the seminar location. This (incomplete set of rules). This means that the learning
strategy is quite effective for analysing documents with phase needs to be tuned.
mixed genres, a common situation in web documents ([5]). The extracted information is also validated using the
ontology. This is possible because each slot in each class
The learning system induces two types of rules: tagging
of the ontology has a type associated with it. Therefore,
rules and correction rules. A tagging rule is composed of a
extracted information which does not match the type
left hand side, containing a pattern of conditions on a
definition of the slot in the ontology can be highlighted as
connected sequence of words, and a right hand side that is
incorrect.
an action inserting an XML tag in the texts. Correction
rules shift misplaced annotations (inserted by tagging rules)
to the correct position. These are learnt from the errors
found whilst attempting to re-annotate the training corpus
using the induced tagging rules.
Correction rules are identical to tagging rules, but (1) their
patterns also match the tags inserted by the tagging rules
and (2) their actions shift misplaced tags rather than adding
3 RELATED WORK contents. The first prototype of the system has now been
completed and tested with both Amilcare and the UMass
A number of annotation tools for producing semantic set of tools. The early results are encouraging in terms of
markup exist. The most interesting of these are Annotea the quality and robustness of our current implementation,
([13]); SHOE Knowledge Annotator ([12]); the COHSE however, there is clearly a lot more work needed to make
annotator ([1]); AeroDAML ([14]); and, OntoMat, a tool this technology easy to use for our target user base (people
being developed using the CREAM annotation framework who are neither experts in language technologies nor
([10]). A commercial version of OntoMat is available as 'power knowledge engineers'). In particular, all the
OntoAnnotate activities associated with automated markup tend to be very
(http://www.ontoprise.de/com/co_produ_tool2.htm). sensitive to the quality of markup and to the
appropriateness of the chosen corpora. Amilcare already
Annotea provides RDF-based markup but it does not attempts to address some of these issues through its
support information extraction nor is it linked to an adaptive mechanisms, however, more work is needed in
ontology server. It does, however, have an annotation this area. In addition, we also plan to do more work on the
server which makes annotations publicly available. SHOE user interface, in particular with respect to the integration
Knowledge Annotator allows users to mark up pages in of markup, ontology browsing and the 'semantic
SHOE guided by ontologies available locally or via a URL. navigation' of web pages. Currently, ontology and web
These marked up pages can be reasoned about by SHOE- browsing are integrated with respect to contents annotation,
aware tools such as SHOE Search. The COHSE annotator but ontologies do not inform the web browsing component
uses an ontology server to mark up pages in DAML+OIL. of MnM directly. Our vision for the semantic web is one in
The results can be saved as RDF. AeroDAML is available which new forms of 'conceptual navigation' will emerge,
as a web page. The user simply enters a URL and the where association between resources will be semantic as
system automatically returns DAML annotations on a web well as hypertextual. We plan to experiment with these
page using a predefined ontology based on WordNet. ideas and extend the interface of MnM to support novel,
Of the systems listed above, OntoMat is closest to MnM markup-driven forms of web browsing, as well as the
both in spirit and in functionality. Both can provide some standard HTML based ones.
form of automated extraction. However, while MnM makes
it possible to access ontology servers through APIs, such as
OKBC, and also to access ontologies specified in a markup
format, such as RDF and DAML+OIL, OntoMat only
provides the latter functionality. In contrast with OntoMat,
MnM can handle multiple ontologies at the same time, ACKNOWLEDGEMENTS
which makes it very easy to switch from one to another,
and also allows inherited definitions to be displayed for This work was funded by the Advanced Knowledge
ontology editing and browsing. On the other hand, Technologies (AKT) Interdisciplinary Research
OntoMat can store pages annotated in DAML+OIL using Collaboration (IRC), which is sponsored by the UK
OntoBroker as an annotation server. It also provides Engineering and Physical Sciences Research Council under
crawlers which can search the Web for marked up pages grant number GR/N15764/01. The AKT IRC comprises the
for addition to its internal knowledge base. Universities of Aberdeen, Edinburgh, Sheffield,
While both MnM and OntoMat are very similar they Southampton and the Open University. The authors would
illustrate a slight difference of emphasis in providing tools like to thank Maruf Hassan and Simon Buckingham Shum
for the Semantic Web. While OntoMat adopts the for their invaluable help in reviewing the first draft of this
philosophy that the markup which indicates the knowledge paper.
content of a web resources should be included as part of
that resource, MnM’s annotations are stored both as
markup on a page and as items in a knowledge base held
on the WebOnto combined ontology and knowledge base REFERENCES
server.
[1] S. Bechhofer and C. Goble, Towards Annotation Using
AML+OIL. First International Conference on Knowledge
Capture (K-CAP 2001). Workshop on Semantic Markup and
4 CONCLUSIONS Annotation. Victoria, BC., Canada, October 2001.
In this paper we have described MnM, an ontology-based [2] D. Brickley and R. Guha, Resource Description Framework
annotation tool which provides both automated and semi- (RDF) Schema Specification 1.0. Candidate recommendation,
automated support for annotating web pages with semantic
World Wide Web Consortium, 2000. URL: and Semantic Annotation, Victoria, B.C., Canada, October
http://www.w3.org/TR/2000/CR-rdf-schema-20000327. 2001.
[15] N. Kushmerick, D. Weld and R. Doorenbos, Wrapper
[3] F. Ciravegna, Adaptive Information Extraction from Text by induction for information extraction, Proc. of 15th
Rule Induction and Generalisation, Proc. of 17th International International Conference on Artificial Intelligence, IJCAI-97.
Joint Conference on Artificial Intelligence (IJCAI 2001) , [16] O. Lassila and R. Swick, Resource Description Framework
Seattle, August 2001. (RDF): Model and Syntax Specification. Recommendation,
[4] F. Ciravegna, LP2 an Adaptive Algorithm for Information World Wide Web Consortium, 1999. URL:
Extraction from Web-related Texts. Proc. of the IJCAI-2001 http://www.w3.org/TR/REC-rdf-syntax/.
Workshop on Adaptive Text Extraction and Mining held in [17] D. Maynard, V. Tablan, H. Cunningham, C. Saggion K.
conjunction with the 17th International Conference on Bontcheva and Y. Wilks, Architectural Elements of Language
Artificial Intelligence (IJCAI-01), August, 2001. Engineering Robustness. Journal of Natural Language
[5] F. Ciravegna, Challenges in Information Extraction from Engineering – Special Issue on Robust Methods in Analysis of
Text for Knowledge Management in IEEE Intelligent Systems Natural Language Data ,forthcoming, 2002.
and Their Applications, November 2001, (Trend and [18] S. McIlraith, T. C. Son and H. Zeng, Semantic Web Services,
Controversies). IEEE Intelligent Systems, Special Issue on the Semantic
[6] F. Ciravegna and D. Petrelli, User Involvement in Adaptive Web, Volume 16, No. 2, pp. 46-53, March/April, 2001.
Information Extraction: Position Paper in Proceedings of the [19] R. S. Mickalski, I. Mozetic, J. Hong, H. Lavrack, The multi
IJCAI-2001 Workshop on Adaptive Text Extraction and purpose incremental learning system AQ15 and its testing
Mining held in conjunction with the 17th International application to three medical domains', in Proceedings of the
Conference on Artificial Intelligence (IJCAI-01), August, 5th National Conference on Artificial Intelligence,
2001. Philadelphia. Morgan Kaufmann publisher, 1986.
[7] E. A. Feigenbaum, The art of artificial intelligence 1: Themes [20] E Motta, Reusable Components for Knowledge Models. IOS
and case studies of knowledge engineering. Technical report, Press, Amsterdam, 1999.
Pub. no. STAN-SC-77-621, Stanford University, Department [21] E. Riloff, An Empirical Study of Automated Dictionary
of Computer Science, 1977. Construction for Information Extraction in Three Domains.
[8] D. Fensel, D. and E. Motta, Structured Development of The AI Journal, 85, 101-134, 1996.
Problem Solving Methods. Transactions on Knowledge and [22] S. Staab, A. Mädche and S. Handschuh, An Annotation
Data Engineering 13(6):9131-932, 2001. Framework for the Semantic Web. In: S. Ishizaki (ed.), Proc.
[9] T. R. Gruber, A Translation Approach to Portable Ontology of The First International Workshop on MultiMedia
Specifications.Knowledge Adquisition 5(2), 199-220, 1993. Annotation. January, 30 - 31, 2001. Tokyo, Japan.
[10] S. Handschuh, S. Staab and A. Maedche, CREAM- Creating [23] M. Vargas-Vera, J. Domingue, Y. Kalfoglou, E. Motta and S.
relational metadata with a component-based, ontology-driven Buckingham-Shum, Template-driven information extraction
annotation framework. First International Conference on for populating ontologies. Proc of the IJCAI'01 Workshop on
Knowledge Capture (K-CAP 2001), Victoria B.C., October Ontology Learning, Seattle, WA, USA 2001.
2001. [24] M. Vargas-Vera, E. Motta, J. Domingue, S. Buckingham
[11] P. Hayes, RDF Model Theory, W3C Working Shum and M. Lanzoni, Knowledge Extraction by using an
Draft, February 2002 URL: http://www.w3.org/TR/rdf-mt/. Ontology-based Annotation Tool. First International
[12] J. Heflin and J. Hendler, A Portrait of the Semantic Web Conference on Knowledge Capture (K-CAP 2001). Workshop
in Action. IEEE Intelligent Systems, 16(2), 2001. on Knowledge Markup and Semantic Annotation , Victoria
[13] J. Kahan, M. Koivunen, E. Prud’Hommeaux and R. Swick, B.C., Canada, October 2001.
Annotea: Open RDF Infrastructure for Shared Web
Annotations. In Proc. of the WWW10 International
Conference. Hong Kong, 2001.
[14] P. Kogut and W. Holmes, AeroDAML: Applying
Information Extraction to Generate DAML Annotations from
Web Pages. First International Conference on Knowledge
Capture (K-CAP 2001). Workshop on Knowledge Markup