An Ontology based document management Jan Hreno1 and Robert Kende1 1 Abstract In this article an approach to the problem of associations of documents with a knowledge base is demonstrated in a real world application. It is based on combination of annotating documents with concepts from a knowledge base and grouping documents together into clusters. Our knowledge base is an ontology provided by a dedicated ontology server. 2 Introduction WWW is slightly becoming the most important communication medium in a last time. There are many reasons for this, but the fact is that most people access information on Internet using web services. Usually, WWW provides one-way communication from publisher to user. In this case we meet a problem of huge amount of unstructured information when it is not easy to find relevant document. This is well known problem for Figure 1 System structure from the point of the system’s which many techniques are being developing like functionality intelligent search engines or ambitious Semantic Web initiative. The central part of this structure is occupied by a knowledge model (KM) module. This system However, WWW can be also successfully used in two- component contains one or more ontological domain way communication between two sides. Such a models providing a conceptual model of a domain. The communication involves discussion, polling, chat, purpose of this component is to index all information predefined reports, questionnaires, query systems etc., stored in the system in order to describe the context of and of course, the classical publishing. Here the problem this information (in terms of domain specific concepts). of too much information arises again, but new The central position symbolises that the knowledge requirement appears in addition. We don’t only want to model is the core (heart) of the system – all parts of the be lost in available information space but also want from system use this module in order to deal with information the system to control our communication, make advises, stored in the system (both for organising this information select or notify the right agent (usually person) on the and accessing it). other side, so that the communication was efficient. The need of user friendly and intelligent communication Information stored within the system has the form of environment is very important point if we want people to documents of different types. Since three main document regularly visit our site or even to be able to use it. types are expected to be processed by the system, a document space can be divided into three subspaces – publishing space, discussion space, and opinion polling Webocrat is a web based system supporting direct space. These areas contain published documents expected participation of citizens in democratic processes, which is to be read by users, users’ contributions to discussions on being developed within Webocracy project. The project different topics of interest, and records of users’ opinions partners are University of Technology in Košice, Slo- about different issues, respectively. vakia, University of Wolverhampton, UK, University of Essen, Germany, JUVIER s.r.o, Slovakia, CITEC Documents stored in these three document subspaces can Engineering Oy Ab, Finland, City Ward Tahanovce, be inter-connected with hyper-textual links – they can Slovakia, City Ward Furca, Slovakia, Wolverhampton contain links to other documents – to documents stored in Metropolitan Borough Council, UK. the same subspace, to documents located in another subspace, and to documents from outside of the system. From the point of view of functionality of the system it is Thus, documents within the system are organised using possible to break down the system into several parts net-like structure. Moreover, documents located in these and/or modules (Mach et al 2001). They can be subspaces should contain links to elements of a domain represented in a layered sandwich-like structure which is model. depicted in Figure 1. Since each document subspace expects different way of manipulating with documents, three system’s modules 1 Technical University of Kosice, Dept of Cybernetics are dedicated to them. Web content management (WCM) and Artificial Intelligence module offers means to manage the publishing space. It enables to prepare documents in order to be published technically possible with MS Word documents), but it (e.g. to link them to elements of a domain model), to can be stored in special files or databases. Based on meta- publish them, and to access them after they are published. knowledge one can perform intelligent retrieval, which Discussion space is managed by discussion forum (DF) gives more relevant results than pure full-text search. module. The module enables users to contribute to discussions they are interested in and/or to read Meta-knowledge can be of two types: contributions submitted by other users. Opinion polling room (OPR) module represents a tool for performing opinion polling on different topics. Users can express 1. List of keywords or description in natural language. their opinions in the form of polling – selecting those Document is enriched with some kind of thesaurus alternatives they prefer. here. Full-text search is performed also with this part giving more precise results. In order to navigate among information stored in the 2. Link to a concept in predefined vocabulary. This system in an easy and effective way, one more layer has method assumes that there exists some vocabulary been added to the system. This layer is focused on of terms or concepts used in the area of our interest. retrieving relevant information from the system in More about this in the next section. various ways. It is represented by two modules, each enabling easy access to the stored information in In our work, we concentrated our effort to annotate a different way. Citizens’ information helpdesk (CIH) electronic document (in our case any document published module is dedicated to search. It represents a search in WCM system) by linking it together with other engine based on the indexing of stored documents. Its relevant documents to relevant concepts from the purpose is to find all those documents which match user’s Knowledge base (in our case ontology). It is based on requirements expressed in the form of a query. grouping together relevant documents and concepts from the ontology. Such a group of documents and concepts The other module performing information retrieval is the we call Association. Every association has its name, Reporter (REP) module. This module is dedicated to description, and some other attributes needed later for the providing information of two types. The first type document retrieval. Basic idea can be seen on Figure 2. represents information in an aggregated form. It enables to define and generate different reports concerning information stored in the system. The other type is focused on providing particular documents – but unlike Document Association the CIH module it is oriented on off-line mode of space operation. It monitors content of the document space on behalf of the user and if information the user may be interested in appears in the system, it sends an alert to Knowledge him/her. base The upper layer of the presented functional structure of the system is represented by a user interface. It integrates Figure 2 Basic idea of the associations functionality of all the modules accessible to a particular user into one coherent portal to the system and provides access to all functions of the system in a uniform way. 3.2 Domain model 3 Using domain model in Webocrat 3.1 Annotation In the previous section there was mentioned the word vocabulary. In the simplest case it is just a list of terms, where each term has its own description – thesaurus. To give a system some kind of intelligence, it must know Such a structure is not satisfactory for our purposes, a meaning of the document - its semantics. Standard because it doesn’t reflect relations among the terms. HTML pages contain almost unstructured information What we want is the model of the real world or its part. that is understandable only by humans, not by computer. The part of the world we are interested in is called There is no way to tell the computer that this article is domain and its model is called domain model. Domain about cars unless it contains word car explicitly or model is based on conceptualisation. A conceptualisation semantic analysis is applied. The solution is to annotate is an abstract, simplified view of the world that we wish the document. This means that explicit information about to represent for some purpose. It consists of concepts that its meaning is attached to it whether manually or represent the objects of our interest in a real world and automatically. Thus, the system can extract relevant relationships that hold them. To formally represent information from every annotated document and use it in domain model we use ontology. Ontology is an explicit some intelligent task like searching. Semantic Web specification of a conceptualisation [1]. initiative is based on this method. It gives proposals and suggestions for annotating HTML pages, using special Domain model allows the system to perform reasoning meta-tags and XML. There is an implicit (tacit) and thus to find relevance of a document not only on information about document in those tags, which is not lexical but also on semantic basis. An example of a part visible to end-user, it is only used by system. In of an ontology is shown in Figure 3 knowledge engineering this information is called meta- knowledge. There are many ways how to store meta- knowledge, it doesn’t need to be in meta-tags (it is not some more links to the set of links inherited from the discussion definition but reducing this inherited set as well. The latter type of links enable user to determine to which existing contribution(s) he/she responds. In addition, it is possible to enrich a contribution to some discussion with links to documents from inside or outside of the system, e.g. in case when the users (submitters) refers in their contributions to those documents. In order to read particular contributions it is necessary to access them. User has several possibilities how to complete this task. First of all, he/she can choose from a list of all available discussions. Another alternative way is to use linking of contributions to elements of a domain model in order to create groups of contributions dealing Figure 3 A part of sample ontology with the same set of issues [2]. Using links to ontology, system can suggest the discussion on some topic when user reads document on 4 Using domain model in Webocrat that topic. Or when user contributes to some discussion, system can advise where to find more relevant The main idea behind whole Webocrat system is to treat information. It would be impossible without links to documents of various types that are associated with a part domain model. Even more, when user links his of domain model – ontology. This way it is possible to contribution to some concepts, overriding linkage of annotate discussions, chats, reports, polling or ordinary whole discussion, system can automatically find more WWW pages. By ordinary documents we mean all the relevant discussion, if existing, and suggest it. Similarly, documents that are published by local authority, such as if some contributions get more and more distant from news, announcements, reports and other documents that topic of original thread, administrator can be notified to could be interesting for public. When they are published, split discussion. The similarity of contributions is they are annotated first, whether manually or semi- measured using distances of corresponding concepts in automatically. After that they are prepared for intelligent the ontology. retrieval. When accessing information, user can make his On this discussion example we showed how the domain query consisting of words for full-text search and of model can enhance communication and how classical terms (concepts) used in ontology. By use of concepts in tools could be used more efficiently. the query it is ensured that also its hidden meaning will be discovered. Formulation of such query also allows the 5 Domain model requirements user to define his personal profile of interest in terms of ontology. Personalised reports and newsletters can be then automatically generated and sent to user. Using experiences from other projects and related work Described scenario assumes that the ontology covers all with ontologies, we had specified some basic attributes, relevant parts of real life concerning to structure of public which we expect our ontology will have. They was as institutions, communal matters, ecology etc. Figure 3 follows: shows sample ontology about institutions. (This is only § some constant types are defined e.g. integer, float, testing example. Real life ontologies are being developed string, date, currency in the time of writing this paper). So we showed how classical web content can be § basic objects are classes, instances, relations annotated for aforementioned one-way communication. § classes can be primitive (definition represents But knowledge about the semantics of document can play necessary but not sufficient conditions) or non- also active role during communication. Discussions are primitive (both sufficient and necessary) typical examples in Webocrat. We consider the discussion as a thread of documents that are all annotated. § a class can be associated with a collection of slots In order to enable to retrieve discussion contributions § slots with predefined semantics: documentation according to their content, it is necessary to create links to elements of a domain model when creating new § a collection of facets can be associated with a slot discussion. These elements will represent topics on which the discussion will be focused. Each contribution which § slot facets with predefined semantics (for classes will be added to this discussion later will be linked to the only): value-type, can be constant type, constant same elements from the domain model in an automatic expression (and, or, not), enumerated type, min- way (contributions inherit links from their discussion). cardinality, max-cardinality, range, can be constant In order to enable organising contributions within the tuple or list of constant tuples, (not) same value as discussion not only according to the date and time of other slot has, subset-of-values as other slot has, submissions or authors of submissions, it is possible to documentation, default value, value complete the contribution with a set of links. These links § an instance can inherit a collection of slots can be of two types – links to elements of a domain model and links to other contributions from within the § only one facet can be associated with a slot of an discussion. The former type of links enables to define the instance: content in more detail (not only in the sense that the § value and default value of a slot can be constant or contribution is about exactly the same issues as the set of constants discussion as a whole) – this includes not only adding § relations can be n-ary for n=1,2,3,... The first goal is achieved by compatibility of the knowledge model of Protégé-2000 with OKBC (Open § relations are defined on basic objects Knowledge Base Connectivity). As a result, Protégé-2000 § relations can have defined attributes: inverse- users can import ontologies from other OKBC- relation - which relation is an inverse to the one, compatible servers and export their ontologies to other disjoint, covered, equivalent, transitive, symmetric, OKBC knowledge servers. Protégé-2000 uses the functional freedom allowed by the OKBC specification to maintain the model of structured knowledge acquisition tools and § predefined relations are: instance-of - between a to achieve the second design goal of being a usable and class and an instance, semantics: inheritance of slots extensible tool. (values, facets), type-of - an inverse relation to Protégé fitted almost all of our requirements for the instance-of, subclass-of - between two classes, knowledge editor. The only one noticeable difference was semantics: inheritance of slots (values, facets), in form, how relations are represented in Protégé. superclass-of - an inverse relation to subclass-of Because of freedom of the ontology specification in § slot facet values are inherited but can be overwritten Protége knowledge model, relations are not defined as (new value must be more constraining than the old basic objects [3]. We discuss later in this article, how to one) solve this lack. Other modifications we did to Protégé were: § multiple inheritance (from more parents) is allowed 1. Localisation of Protégé into more languages (at this § special classes time it is localised into Slovak version) o THING - represents the root of the class 2. Adding ability to graphically view classes hierarchy structure (Figure 4). It will help the user easily browse ontology in a graphical view. The graph § every defined class is a subclass of layout is computed automatically or can be changed THING, by user. § every instance is an instance of THING § has slot "documentation" with value- type STRING o CLASS - class of all classes o INSTANCES - class of all instances In current state of the project we needed to offer for our partners tool for creating and editing ontology. Because Knowledge Module task starts in our project in future, we had specified some other requirements for knowledge editor: § it has to be flexible, to enable later modifications in knowledge model § platform independence § it should enable importing ontologies from Figure 4 Graphview tab for Protégé 2000 other formats Thus we dedicated to use some kind of Open Source knowledge editor programmed in JAVA instead of 7 Representing relations in Protégé programming new one and to modify it for our purposes. Tool, which best fitted into mostly all of our requirement Because relations are not basic Protégé objects, we have seemed to be Protégé 2000 from Stanford University. to model them. In the discussion within Protégé Other knowledge editors we have tested was OntoEdit, community four possible solutions were proposed: JOT, GEF, Apollo, SiLRI. Option 1 We can use own slots. This is probably the easiest way to 6 Using Protege 2000 for creating go, but it is also the most restrictive one. Here the ontologies relations are own slots on all subclasses of the class that first specified those slots. The values of the slots are Protégé-2000 is the latest component-based and platform- classes that they are related to in one way or another. independent generation of the ontology editor. Two goals Advantage: have driven the design and development of Protégé-2000: § Very easy to model § We already have all the interface and underlying 1. achieving interoperability with other knowledge- structures in Protégé for this. representation systems, and Problems: 2. being an easy-to-use and configurable knowledge- acquisition tool. § We can not add additional information, such as Protégé 2000 does not treat DISJOINT or TRANSITIVE orientation, in particular, when the value of a slot is facets in some special way. They are only used by a list of classes and not a single class reasoning mechanism which will be developed later and § will not be a part of Protégé itself. Option 2 (extension of Option 1) Use facets on own slots (own slots on own slots) to specify orientation and other additional properties Problem: 8 Acknowledgements § Too complicated: it is hard even to explain exactly how things are going to work. This work is done within the Webocracy project “Web § Technologies Supporting Direct Participation in Option 3 Democratic Processes”, which is supported by European Use template slots. Since slots are first-class objects in Commission DG INFSO under the IST program, contract Protégé (they are themselves frames) , it is easy to no. IST-1999-20364, and within the VEGA project express attributes of relations such as reflexivity, 1/8131/01 ”Knowledge Technologies for Information transitivity, etc, as well as a hierarchy of relations (the Acquisition and Retrieval” of Scientific Grant Agency of same is true for Option 1). Ministry of Education of the Slovak Republic. Advantage: § Can use advantages of inheritance more extensively. The content of this publication is the sole responsibility § Own slots on classes are harder to explain and of the authors, and in no way represents the view of the understand template slots are easier. European Commission or its services. Problems: § It is harder to express additional constraints on relations, such as orientation. 9 References § Option 4 [1] Gruber, T., R. (1993): A translation approach to Relations are themselves classes. We can go one step portable ontologies. Knowledge Acquisition, 5(2):199- further and reify relations as classes themselves. 220. Relations between particular classes are instances of [2] Mach, M.; Dridi, F.; Furdik, K. (2001): Webocrat these Relation classes System Architecture and Functionality. Webocracy report Advantages: 2.4. § Can easily encode meta-information on relations: [3] Noy, N., F.; Fergerson, R., W.; Musen, M., A. (2000): Reflexive, Transitive, Inverse. All of these The knowledge model of Protégé-2000: combining properties are own slots on a Relation class interoperability and flexibility. International Conference § Relations can have additional slots, such as on Knowledge Engineering and Knowledge Management orientation, that get instantiated when we define (EKAW '2000), Juan-les-Pins, France. relations between classes. [4] Sabol, T.; Jackson, M.; Dridi, F.; Palola, I.; Novacek, The first advantage also carries over to most of the earlier E.; Cizmarik, T.; Thompson, P. (2001): Dissemination options with the exception that the additional information and Use Plan. Webocracy report 15.2.1. (relation attributes, hierarchy) would be on slots and not classes, which is often harder to understand and manipulate. Problem: § Specialized browsing that "jumps over" a level to view hierarchies of entities based on each relation will be needed (for example, view the part-of hierarchy). All of these four options can be combined. Price for this is then loose of the uniform approach to describing properties of relations such as transitivity, inverses and so on. Option 4 looks like the most suitable one, but it would be uncomfortable for user to define special class for any possible type of relation. Since real applications are not developed yet, we cannot predicate the number of relations needed. We decided for option 3. The EXTENDED_SLOT class has been defined with new facets TRANSITIVE and DISJOINT. Other attributes can be easily added at any time. This EXTENDED_SLOT class is set to be default, so that every new slot that is created on any class is a subclass of EXTENDED_SLOT and thus it automatically contains required attributes TRANSITIVE and DISJOINT. Relation between two objects is modelled as a slot, where one class of relation contains that slot and second class is a value of that slot.