MnM: Ontology-Driven Tool for Semantic Markup Maria Vargas-Vera1 and Enrico Motta1 and John Domingue1 and Mattia Lanzoni1 and Arthur Stutt1 and Fabio Ciravegna2 agents able to exchange information and carrying out complex problem solving on the web is based on the assumption that these agents will share common, explicitly Abstract. An important precondition for realising the goal of a semantic web is the ability to defined, generic conceptualizations. These are typically annotate web resources with semantic models of a particular area, such as product catalogues, or information. In order to carry out this task, users taxonomies of medical conditions, although ontologies can need appropriate representation languages, also be used to support the specification of reasoning ontologies, and support tools. In this paper we services ([18], [20] and [8]), thus allowing not only ‘static’ present MnM, an annotation tool which provides interoperability through shared domain conceptualizations, both automated and semi-automated support for but also ‘dynamic’ interoperability through the explicit annotating web pages with semantic contents. publication of competence specifications, which can be MnM integrates a web browser with an ontology reasoned about to determine whether a particular semantic editor and provides open APIs to link to ontology servers and for integrating information extraction web service is appropriate for a particular task. tools. MnM can be seen as an early example of Ontologies and representation languages provide the basic the next generation of ontology editors, being semantic tools to construct the semantic web. Obviously a web-based, oriented to semantic markup and lot more is needed; in particular, tool support is needed to providing mechanisms for large-scale automatic facilitate the development of semantic resources, given a markup of web pages. particular ontology and representation language. This problem is not a new one, knowledge engineers early on realised that one of the main obstacles to the development of intelligent, knowledge-based systems was the so-called 1 INTRODUCTION knowledge acquisition bottleneck ([7]). In a nutshell, the problem is how to acquire and represent knowledge, so that An important pre-condition for realising the goal of the this knowledge can be effectively used by a reasoning semantic web is the ability to annotate web resources with system. Although the problem is not a new one, the context semantic information. In order to carry out this task, users provides by the semantic web introduces new aspects to the need appropriate knowledge representation languages, problem, with respect to the nature of the knowledge and ontologies, and support tools. The knowledge the type of users. representation language provides the semantic interlingua for expressing knowledge precisely. RDF ([11] and [16]) Nature of the knowledge. Traditional knowledge and RDFS [2] provide the basic framework for expressing acquisition was concerned with knowledge for problem metadata on the web, while current developments in web- solving. Semantic markup will primarily focus on ontology based knowledge representation, such as DAML+OIL population, a far easier knowledge acquisition task. (DAML+OIL, 2001) and WebOnt (http://www.w3.org), Type of users. Knowledge-based systems are normally are building on the RDF base framework to provide more written by skilled knowledge engineers. On the web, it is sophisticated knowledge representation support. Ontologies likely that semantic marking up will become a common ([9]) provide the mechanism to support interoperability at a activity, carried out by content providers who are not conceptual level. In a nutshell, the idea of interoperating necessarily skilled knowledge engineers. This means that 1 The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK. Email:{m.vargas-vera; e.motta; j.b.domingue; m.lanzoni; a.stutt}@open.ac.uk 2 Department of Computer Science, University of Sheffield, Regent Court 211 Portobello Street, Sheffield S1 4DP, UK. Email:f.ciravegna@dcs.shef.ac.uk more emphasis will have to be put on facilitating semantic • Extract. An IE mechanism is selected and run over markup by ‘ordinary’ web users (people who are neither a set of documents experts in language technologies nor 'power knowledge We will now provide more details of each of the above engineers'). In particular, automated knowledge extraction activities in turn. technologies are likely to play an ever increasing important role, as a crucial technology to tackle the semantic web Browse version of the knowledge acquisition bottleneck. In this activity the user browses a library of knowledge In this paper we present MnM, an annotation tool which models which sit on a web based ontology server. The user provides both automated and semi-automated support for can see an overview of the existing models and can select marking up web pages with semantic contents. MnM which one to focus on (i.e., which ontology to use to integrates a web browser with an ontology editor and initiate the markup process). Within a selected ontology the provides open APIs to link to ontology servers and for user can browse the existing items - for example the classes integrating information extraction tools. MnM can be seen and instances. Items within an ontology can be selected as as an early example of the next generation of ontology the starting point for selecting an IE mechanism. More editors, being web-based, oriented to semantic markup and specifically, the selected class forms the basis for a providing mechanisms for large-scale automatic markup of template which will eventually be matched against a corpus web pages. of documents and instantiated in the extraction activity. The rest of the paper is organised as follows: in the next Mark-Up section we will show the process model underlying the Once a class has been selected a training corpus of design of the tool. Finally sections 3 and 4 discuss related manually marked up pages needs to be created. Here the work and re-state the main tenets and results from our user views appropriate documents within MnM’s built-in research. web browser and annotates segments of text using the tags based on the class’s slot as given in the ontology (i.e., ontology driven mark-up). As the text is selected MnM inserts the relevant XML tags into the document. 2 PROCESS MODEL Learning Within this work we have focused on creating a generic MnM integrates web browsing, ontology browsing and IE process model for developing semantically enriched web development. It does not have a built-in IE tool but content. The component tools which are used in MnM are provides a plug-in interface which allows the integration of ontology servers, Information Extraction (IE) tools and IE tools easily. augmented web browsers. During our initial work in this In a previous version of our MnM we integrated Marmot, area we found that either the existing tools did not directly Badger and Crystal from the University of Massachusetts support the creation of semantic web content or the ([21]) and our own NLP components (i.e., OCML mapping between the tasks to be carried out and the toolset preprocessor). A full description of this version can be was non-trivial. Hence, within MnM, we adopted a generic found in ([23] and [24]). However, in this paper we will process model, which can be easily understood by web concentrate on the recent integration work that we have developers who are not necessarily expert ontology carried out with Amilcare, a tool for adaptive information engineers or human language technology experts. extraction ([3]). Another key feature of our process model is that it is generic with respect to the specific ontology server and IE Amilcare is designed to support active annotation of technologies used. documents. It performs IE by enriching texts with XML There are five main activities supported by MnM: annotations. To use Amilcare in a new domain the user simply has to manually annotate a training set of • Browse. A specific set of knowledge components documents. No knowledge of Natural Language is chosen from a library of knowledge models on Technologies is necessary. an ontology server. Amilcare is designed to accommodate the needs of • Markup. The chosen set of knowledge different user types. While naïve users can build new components is selected to form the basis of an IE applications without delving into the complexity of Human mechanism. A corpus of documents are manually Language Technology, IE experts are provided with a marked up. number of facilities for tuning the final application. • Learn. A learning algorithm is run over the Induced rules can be inspected, monitored and edited to marked up corpus to learn the extraction rules. obtain some additional accuracy, if required. The interface also allows precision (P) and recall (R) to be balanced. • Test. The IE mechanism is run over a test corpus The system can be run on an annotated unseen corpus and to assess its precision and recall measures. users are presented with statistics on accuracy, together with details on correct matches and mistakes. Retuning the new ones. The output of the training phase is a collection P&R balance does not generally require major retraining, of rules for IE that are associated with the specific scenario. facilities for inspecting the effect of different P&R balances are provided. Although the current interface for balancing P&R is designed for IE experts, a future version will Testing provide support for naive users ([6]). MnM provides various mechanisms for selecting a test corpus and distinguish this from a training corpora. The At the start of the learning phase Amilcare preprocesses user can manually select training and test corpora and these texts using Annie, the shallow IE system included in the can be in the form of local files or on the web. In addition, Gate package ([17], www.gate.ac.uk). Annie performs text it is also possible to simply select a corpus (either locally or tokenization (segmenting texts into words), sentence on the web) and let the system to create, to test and training splitting (identifying sentences) part of speech tagging corpora randomly. (lexical disambiguation), gazetteer lookup (dictionary lookup) and named entity recognition (recognition of Extraction people and organization names, dates, etc.). After the training phase Amilcare has a library of induced Amilcare then induces rules for information extraction. The rules which can be used to extract information from texts. learning system is based on LP2, a covering algorithm for When working in extraction mode, Amilcare receives as supervised learning of IE rules based on Lazy-NLP ([3] input a (collection of) text(s) with the associated scenario – and [4]). This is a wrapper induction methodology ([15]) scenario is the set of tags that the user will insert in the that, unlike other wrapper induction approaches, uses training corpora- (including the rules induced during the linguistic information in the rule generalization process. training phase). It preprocesses the text(s) by using Annie The learning system starts inducing wrapper-like rules that and then it applies its rules and returns the original text make no use of linguistic information, where rules are sets with the added annotations. The Gate annotation schema is of conjunctive conditions on adjacent words. Then the used for annotation ([17]). linguistic information provided by Annie is used in order to Once that is done the information extracted is presented to create generalised rules: conditions on words are the user for approval. Then the extracted information is substituted with conditions on the linguistic information sent to the ontology server which will populate the selected (e.g. condition matching on either the lexical category, or ontology. the class provided by the gazetteer, etc. ([4]). All the generalizations are tested in parallel by using a During the population step the IE mechanism fills variant of the AQ algorithm ([19]) and the best predefined slots associated with an extraction template. generalizations are kept for IE. The idea is that the Each template consists of slots of a particular class as linguistic-based generalisation is deployed only when the defined in the selected ontology, for instance, the class use of NLP information is reliable or effective. The visiting-a-place-or-people has the slots: visitor, place, etc. measure of reliability here is not linguistic correctness, but Our goal is to automatically fill as many slots as effectiveness in extracting information using linguistic possible. However, some of the slots may still require information as opposed to using shallower approaches. manual intervention. There are several reasons for this Lazy NLP-based systems learn which is the best strategy problem: for each information/context separately. For example they • there is information that is not contained in the text, may decide that using the result of a part of speech tagger • none of the rules from our IE libraries match with is the best strategy for recognising the speaker in seminar the sentence that might provide the information announcements, but not to spot the seminar location. This (incomplete set of rules). This means that the learning strategy is quite effective for analysing documents with phase needs to be tuned. mixed genres, a common situation in web documents ([5]). The extracted information is also validated using the ontology. This is possible because each slot in each class The learning system induces two types of rules: tagging of the ontology has a type associated with it. Therefore, rules and correction rules. A tagging rule is composed of a extracted information which does not match the type left hand side, containing a pattern of conditions on a definition of the slot in the ontology can be highlighted as connected sequence of words, and a right hand side that is incorrect. an action inserting an XML tag in the texts. Correction rules shift misplaced annotations (inserted by tagging rules) to the correct position. These are learnt from the errors found whilst attempting to re-annotate the training corpus using the induced tagging rules. Correction rules are identical to tagging rules, but (1) their patterns also match the tags inserted by the tagging rules and (2) their actions shift misplaced tags rather than adding 3 RELATED WORK contents. The first prototype of the system has now been completed and tested with both Amilcare and the UMass A number of annotation tools for producing semantic set of tools. The early results are encouraging in terms of markup exist. The most interesting of these are Annotea the quality and robustness of our current implementation, ([13]); SHOE Knowledge Annotator ([12]); the COHSE however, there is clearly a lot more work needed to make annotator ([1]); AeroDAML ([14]); and, OntoMat, a tool this technology easy to use for our target user base (people being developed using the CREAM annotation framework who are neither experts in language technologies nor ([10]). A commercial version of OntoMat is available as 'power knowledge engineers'). In particular, all the OntoAnnotate activities associated with automated markup tend to be very (http://www.ontoprise.de/com/co_produ_tool2.htm). sensitive to the quality of markup and to the appropriateness of the chosen corpora. Amilcare already Annotea provides RDF-based markup but it does not attempts to address some of these issues through its support information extraction nor is it linked to an adaptive mechanisms, however, more work is needed in ontology server. It does, however, have an annotation this area. In addition, we also plan to do more work on the server which makes annotations publicly available. SHOE user interface, in particular with respect to the integration Knowledge Annotator allows users to mark up pages in of markup, ontology browsing and the 'semantic SHOE guided by ontologies available locally or via a URL. navigation' of web pages. Currently, ontology and web These marked up pages can be reasoned about by SHOE- browsing are integrated with respect to contents annotation, aware tools such as SHOE Search. The COHSE annotator but ontologies do not inform the web browsing component uses an ontology server to mark up pages in DAML+OIL. of MnM directly. Our vision for the semantic web is one in The results can be saved as RDF. AeroDAML is available which new forms of 'conceptual navigation' will emerge, as a web page. The user simply enters a URL and the where association between resources will be semantic as system automatically returns DAML annotations on a web well as hypertextual. We plan to experiment with these page using a predefined ontology based on WordNet. ideas and extend the interface of MnM to support novel, Of the systems listed above, OntoMat is closest to MnM markup-driven forms of web browsing, as well as the both in spirit and in functionality. Both can provide some standard HTML based ones. form of automated extraction. However, while MnM makes it possible to access ontology servers through APIs, such as OKBC, and also to access ontologies specified in a markup format, such as RDF and DAML+OIL, OntoMat only provides the latter functionality. In contrast with OntoMat, MnM can handle multiple ontologies at the same time, ACKNOWLEDGEMENTS which makes it very easy to switch from one to another, and also allows inherited definitions to be displayed for This work was funded by the Advanced Knowledge ontology editing and browsing. On the other hand, Technologies (AKT) Interdisciplinary Research OntoMat can store pages annotated in DAML+OIL using Collaboration (IRC), which is sponsored by the UK OntoBroker as an annotation server. It also provides Engineering and Physical Sciences Research Council under crawlers which can search the Web for marked up pages grant number GR/N15764/01. The AKT IRC comprises the for addition to its internal knowledge base. Universities of Aberdeen, Edinburgh, Sheffield, While both MnM and OntoMat are very similar they Southampton and the Open University. The authors would illustrate a slight difference of emphasis in providing tools like to thank Maruf Hassan and Simon Buckingham Shum for the Semantic Web. While OntoMat adopts the for their invaluable help in reviewing the first draft of this philosophy that the markup which indicates the knowledge paper. content of a web resources should be included as part of that resource, MnM’s annotations are stored both as markup on a page and as items in a knowledge base held on the WebOnto combined ontology and knowledge base REFERENCES server. [1] S. Bechhofer and C. Goble, Towards Annotation Using AML+OIL. First International Conference on Knowledge Capture (K-CAP 2001). Workshop on Semantic Markup and 4 CONCLUSIONS Annotation. Victoria, BC., Canada, October 2001. In this paper we have described MnM, an ontology-based [2] D. Brickley and R. Guha, Resource Description Framework annotation tool which provides both automated and semi- (RDF) Schema Specification 1.0. Candidate recommendation, automated support for annotating web pages with semantic World Wide Web Consortium, 2000. URL: and Semantic Annotation, Victoria, B.C., Canada, October http://www.w3.org/TR/2000/CR-rdf-schema-20000327. 2001. [15] N. Kushmerick, D. Weld and R. Doorenbos, Wrapper [3] F. Ciravegna, Adaptive Information Extraction from Text by induction for information extraction, Proc. of 15th Rule Induction and Generalisation, Proc. of 17th International International Conference on Artificial Intelligence, IJCAI-97. Joint Conference on Artificial Intelligence (IJCAI 2001) , [16] O. Lassila and R. Swick, Resource Description Framework Seattle, August 2001. (RDF): Model and Syntax Specification. Recommendation, [4] F. Ciravegna, LP2 an Adaptive Algorithm for Information World Wide Web Consortium, 1999. URL: Extraction from Web-related Texts. Proc. of the IJCAI-2001 http://www.w3.org/TR/REC-rdf-syntax/. Workshop on Adaptive Text Extraction and Mining held in [17] D. Maynard, V. Tablan, H. Cunningham, C. Saggion K. conjunction with the 17th International Conference on Bontcheva and Y. Wilks, Architectural Elements of Language Artificial Intelligence (IJCAI-01), August, 2001. Engineering Robustness. Journal of Natural Language [5] F. Ciravegna, Challenges in Information Extraction from Engineering – Special Issue on Robust Methods in Analysis of Text for Knowledge Management in IEEE Intelligent Systems Natural Language Data ,forthcoming, 2002. and Their Applications, November 2001, (Trend and [18] S. McIlraith, T. C. Son and H. Zeng, Semantic Web Services, Controversies). IEEE Intelligent Systems, Special Issue on the Semantic [6] F. Ciravegna and D. Petrelli, User Involvement in Adaptive Web, Volume 16, No. 2, pp. 46-53, March/April, 2001. Information Extraction: Position Paper in Proceedings of the [19] R. S. Mickalski, I. Mozetic, J. Hong, H. Lavrack, The multi IJCAI-2001 Workshop on Adaptive Text Extraction and purpose incremental learning system AQ15 and its testing Mining held in conjunction with the 17th International application to three medical domains', in Proceedings of the Conference on Artificial Intelligence (IJCAI-01), August, 5th National Conference on Artificial Intelligence, 2001. Philadelphia. Morgan Kaufmann publisher, 1986. [7] E. A. Feigenbaum, The art of artificial intelligence 1: Themes [20] E Motta, Reusable Components for Knowledge Models. IOS and case studies of knowledge engineering. Technical report, Press, Amsterdam, 1999. Pub. no. STAN-SC-77-621, Stanford University, Department [21] E. Riloff, An Empirical Study of Automated Dictionary of Computer Science, 1977. Construction for Information Extraction in Three Domains. [8] D. Fensel, D. and E. Motta, Structured Development of The AI Journal, 85, 101-134, 1996. Problem Solving Methods. Transactions on Knowledge and [22] S. Staab, A. Mädche and S. Handschuh, An Annotation Data Engineering 13(6):9131-932, 2001. Framework for the Semantic Web. In: S. Ishizaki (ed.), Proc. [9] T. R. Gruber, A Translation Approach to Portable Ontology of The First International Workshop on MultiMedia Specifications.Knowledge Adquisition 5(2), 199-220, 1993. Annotation. January, 30 - 31, 2001. Tokyo, Japan. [10] S. Handschuh, S. Staab and A. Maedche, CREAM- Creating [23] M. Vargas-Vera, J. Domingue, Y. Kalfoglou, E. Motta and S. relational metadata with a component-based, ontology-driven Buckingham-Shum, Template-driven information extraction annotation framework. First International Conference on for populating ontologies. Proc of the IJCAI'01 Workshop on Knowledge Capture (K-CAP 2001), Victoria B.C., October Ontology Learning, Seattle, WA, USA 2001. 2001. [24] M. Vargas-Vera, E. Motta, J. Domingue, S. Buckingham [11] P. Hayes, RDF Model Theory, W3C Working Shum and M. Lanzoni, Knowledge Extraction by using an Draft, February 2002 URL: http://www.w3.org/TR/rdf-mt/. Ontology-based Annotation Tool. First International [12] J. Heflin and J. Hendler, A Portrait of the Semantic Web Conference on Knowledge Capture (K-CAP 2001). Workshop in Action. IEEE Intelligent Systems, 16(2), 2001. on Knowledge Markup and Semantic Annotation , Victoria [13] J. Kahan, M. Koivunen, E. Prud’Hommeaux and R. Swick, B.C., Canada, October 2001. Annotea: Open RDF Infrastructure for Shared Web Annotations. In Proc. of the WWW10 International Conference. Hong Kong, 2001. [14] P. Kogut and W. Holmes, AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages. First International Conference on Knowledge Capture (K-CAP 2001). Workshop on Knowledge Markup