An Ontology based document management
                                              Jan Hreno1 and Robert Kende1


1     Abstract

In this article an approach to the problem of associations
of documents with a knowledge base is demonstrated in a
real world application. It is based on combination of
annotating documents with concepts from a knowledge
base and grouping documents together into clusters. Our
knowledge base is an ontology provided by a dedicated
ontology server.


2     Introduction

WWW is slightly becoming the most important
communication medium in a last time. There are many
reasons for this, but the fact is that most people access
information on Internet using web services. Usually,
WWW provides one-way communication from publisher
to user. In this case we meet a problem of huge amount of
unstructured information when it is not easy to find
relevant document. This is well known problem for               Figure 1 System structure from the point of the system’s
which many techniques are being developing like                 functionality
intelligent search engines or ambitious Semantic Web
initiative.                                                     The central part of this structure is occupied by
                                                                a knowledge model (KM) module. This system
However, WWW can be also successfully used in two-              component contains one or more ontological domain
way communication between two sides. Such a                     models providing a conceptual model of a domain. The
communication involves discussion, polling, chat,               purpose of this component is to index all information
predefined reports, questionnaires, query systems etc.,         stored in the system in order to describe the context of
and of course, the classical publishing. Here the problem       this information (in terms of domain specific concepts).
of too much information arises again, but new                   The central position symbolises that the knowledge
requirement appears in addition. We don’t only want to          model is the core (heart) of the system – all parts of the
be lost in available information space but also want from       system use this module in order to deal with information
the system to control our communication, make advises,          stored in the system (both for organising this information
select or notify the right agent (usually person) on the        and accessing it).
other side, so that the communication was efficient. The
need of user friendly and intelligent communication             Information stored within the system has the form of
environment is very important point if we want people to        documents of different types. Since three main document
regularly visit our site or even to be able to use it.          types are expected to be processed by the system,
                                                                a document space can be divided into three subspaces –
                                                                publishing space, discussion space, and opinion polling
Webocrat is a web based system supporting direct                space. These areas contain published documents expected
participation of citizens in democratic processes, which is     to be read by users, users’ contributions to discussions on
being developed within Webocracy project. The project           different topics of interest, and records of users’ opinions
partners are University of Technology in Košice, Slo-           about different issues, respectively.
vakia, University of Wolverhampton, UK, University of
Essen, Germany, JUVIER s.r.o, Slovakia, CITEC                   Documents stored in these three document subspaces can
Engineering Oy Ab, Finland, City Ward Tahanovce,                be inter-connected with hyper-textual links – they can
Slovakia, City Ward Furca, Slovakia, Wolverhampton              contain links to other documents – to documents stored in
Metropolitan Borough Council, UK.                               the same subspace, to documents located in another
                                                                subspace, and to documents from outside of the system.
From the point of view of functionality of the system it is     Thus, documents within the system are organised using
possible to break down the system into several parts            net-like structure. Moreover, documents located in these
and/or modules (Mach et al 2001). They can be                   subspaces should contain links to elements of a domain
represented in a layered sandwich-like structure which is       model.
depicted in Figure 1.
                                                                Since each document subspace expects different way of
                                                                manipulating with documents, three system’s modules
1 Technical University of Kosice, Dept of Cybernetics
                                                                are dedicated to them. Web content management (WCM)
and Artificial Intelligence
                                                                module offers means to manage the publishing space. It
enables to prepare documents in order to be published          technically possible with MS Word documents), but it
(e.g. to link them to elements of a domain model), to          can be stored in special files or databases. Based on meta-
publish them, and to access them after they are published.     knowledge one can perform intelligent retrieval, which
Discussion space is managed by discussion forum (DF)           gives more relevant results than pure full-text search.
module. The module enables users to contribute to
discussions they are interested in and/or to read              Meta-knowledge can be of two types:
contributions submitted by other users. Opinion polling
room (OPR) module represents a tool for performing
opinion polling on different topics. Users can express         1. List of keywords or description in natural language.
their opinions in the form of polling – selecting those             Document is enriched with some kind of thesaurus
alternatives they prefer.                                           here. Full-text search is performed also with this part
                                                                    giving more precise results.
In order to navigate among information stored in the           2. Link to a concept in predefined vocabulary. This
system in an easy and effective way, one more layer has            method assumes that there exists some vocabulary
been added to the system. This layer is focused on                 of terms or concepts used in the area of our interest.
retrieving relevant information from the system in                 More about this in the next section.
various ways. It is represented by two modules, each
enabling easy access to the stored information in              In our work, we concentrated our effort to annotate
a different way. Citizens’ information helpdesk (CIH)          electronic document (in our case any document published
module is dedicated to search. It represents a search          in WCM system) by linking it together with other
engine based on the indexing of stored documents. Its          relevant documents to relevant concepts from the
purpose is to find all those documents which match user’s      Knowledge base (in our case ontology). It is based on
requirements expressed in the form of a query.                 grouping together relevant documents and concepts from
                                                               the ontology. Such a group of documents and concepts
The other module performing information retrieval is the       we call Association. Every association has its name,
Reporter (REP) module. This module is dedicated to             description, and some other attributes needed later for the
providing information of two types. The first type             document retrieval. Basic idea can be seen on Figure 2.
represents information in an aggregated form. It enables
to define and generate different reports concerning
information stored in the system. The other type is
focused on providing particular documents – but unlike           Document
                                                                                       Association
the CIH module it is oriented on off-line mode of                space
operation. It monitors content of the document space on
behalf of the user and if information the user may be
interested in appears in the system, it sends an alert to                                                   Knowledge
him/her.                                                                                                      base

The upper layer of the presented functional structure of
the system is represented by a user interface. It integrates            Figure 2 Basic idea of the associations
functionality of all the modules accessible to a particular
user into one coherent portal to the system and provides
access to all functions of the system in a uniform way.
                                                               3.2     Domain model
3 Using domain model in Webocrat
3.1 Annotation                                                 In the previous section there was mentioned the word
                                                               vocabulary. In the simplest case it is just a list of terms,
                                                               where each term has its own description – thesaurus.
To give a system some kind of intelligence, it must know       Such a structure is not satisfactory for our purposes,
a meaning of the document - its semantics. Standard            because it doesn’t reflect relations among the terms.
HTML pages contain almost unstructured information             What we want is the model of the real world or its part.
that is understandable only by humans, not by computer.        The part of the world we are interested in is called
There is no way to tell the computer that this article is      domain and its model is called domain model. Domain
about cars unless it contains word car explicitly or           model is based on conceptualisation. A conceptualisation
semantic analysis is applied. The solution is to annotate      is an abstract, simplified view of the world that we wish
the document. This means that explicit information about       to represent for some purpose. It consists of concepts that
its meaning is attached to it whether manually or              represent the objects of our interest in a real world and
automatically. Thus, the system can extract relevant           relationships that hold them. To formally represent
information from every annotated document and use it in        domain model we use ontology. Ontology is an explicit
some intelligent task like searching. Semantic Web             specification of a conceptualisation [1].
initiative is based on this method. It gives proposals and
suggestions for annotating HTML pages, using special
                                                               Domain model allows the system to perform reasoning
meta-tags and XML. There is an implicit (tacit)
                                                               and thus to find relevance of a document not only on
information about document in those tags, which is not
                                                               lexical but also on semantic basis. An example of a part
visible to end-user, it is only used by system. In
                                                               of an ontology is shown in Figure 3
knowledge engineering this information is called meta-
knowledge. There are many ways how to store meta-
knowledge, it doesn’t need to be in meta-tags (it is not
                                                                some more links to the set of links inherited from the
                                                                discussion definition but reducing this inherited set as
                                                                well. The latter type of links enable user to determine to
                                                                which existing contribution(s) he/she responds. In
                                                                addition, it is possible to enrich a contribution to some
                                                                discussion with links to documents from inside or outside
                                                                of the system, e.g. in case when the users (submitters)
                                                                refers in their contributions to those documents.
                                                                In order to read particular contributions it is necessary to
                                                                access them. User has several possibilities how to
                                                                complete this task. First of all, he/she can choose from
                                                                a list of all available discussions. Another alternative way
                                                                is to use linking of contributions to elements of a domain
                                                                model in order to create groups of contributions dealing
           Figure 3 A part of sample ontology                   with the same set of issues [2].
                                                                Using links to ontology, system can suggest the
                                                                discussion on some topic when user reads document on
4     Using domain model in Webocrat                            that topic. Or when user contributes to some discussion,
                                                                system can advise where to find more relevant
The main idea behind whole Webocrat system is to treat          information. It would be impossible without links to
documents of various types that are associated with a part      domain model. Even more, when user links his
of domain model – ontology. This way it is possible to          contribution to some concepts, overriding linkage of
annotate discussions, chats, reports, polling or ordinary       whole discussion, system can automatically find more
WWW pages. By ordinary documents we mean all the                relevant discussion, if existing, and suggest it. Similarly,
documents that are published by local authority, such as        if some contributions get more and more distant from
news, announcements, reports and other documents that           topic of original thread, administrator can be notified to
could be interesting for public. When they are published,       split discussion. The similarity of contributions is
they are annotated first, whether manually or semi-             measured using distances of corresponding concepts in
automatically. After that they are prepared for intelligent     the ontology.
retrieval. When accessing information, user can make his        On this discussion example we showed how the domain
query consisting of words for full-text search and of           model can enhance communication and how classical
terms (concepts) used in ontology. By use of concepts in        tools could be used more efficiently.
the query it is ensured that also its hidden meaning will
be discovered. Formulation of such query also allows the        5     Domain model requirements
user to define his personal profile of interest in terms of
ontology. Personalised reports and newsletters can be
then automatically generated and sent to user.                  Using experiences from other projects and related work
Described scenario assumes that the ontology covers all         with ontologies, we had specified some basic attributes,
relevant parts of real life concerning to structure of public   which we expect our ontology will have. They was as
institutions, communal matters, ecology etc. Figure 3           follows:
shows sample ontology about institutions. (This is only         §    some constant types are defined e.g. integer, float,
testing example. Real life ontologies are being developed            string, date, currency
in the time of writing this paper).
So we showed how classical web content can be                   §    basic objects are classes, instances, relations
annotated for aforementioned one-way communication.             §    classes can be primitive (definition represents
But knowledge about the semantics of document can play               necessary but not sufficient conditions) or non-
also active role during communication. Discussions are               primitive (both sufficient and necessary)
typical examples in Webocrat. We consider the
discussion as a thread of documents that are all annotated.     §    a class can be associated with a collection of slots
In order to enable to retrieve discussion contributions
                                                                §    slots with predefined semantics: documentation
according to their content, it is necessary to create links
to elements of a domain model when creating new                 §    a collection of facets can be associated with a slot
discussion. These elements will represent topics on which
the discussion will be focused. Each contribution which         §    slot facets with predefined semantics (for classes
will be added to this discussion later will be linked to the         only): value-type, can be constant type, constant
same elements from the domain model in an automatic                  expression (and, or, not), enumerated type, min-
way (contributions inherit links from their discussion).             cardinality, max-cardinality, range, can be constant
In order to enable organising contributions within the               tuple or list of constant tuples, (not) same value as
discussion not only according to the date and time of                other slot has, subset-of-values as other slot has,
submissions or authors of submissions, it is possible to             documentation, default value, value
complete the contribution with a set of links. These links      §    an instance can inherit a collection of slots
can be of two types – links to elements of a domain
model and links to other contributions from within the          §    only one facet can be associated with a slot of an
discussion. The former type of links enables to define the           instance:
content in more detail (not only in the sense that the
                                                                §    value and default value of a slot can be constant or
contribution is about exactly the same issues as the
                                                                     set of constants
discussion as a whole) – this includes not only adding
§    relations can be n-ary for n=1,2,3,...                   The first goal is achieved by compatibility of the
                                                              knowledge model of Protégé-2000 with OKBC (Open
§    relations are defined on basic objects                   Knowledge Base Connectivity). As a result, Protégé-2000
§    relations can have defined attributes: inverse-          users can import ontologies from other OKBC-
     relation - which relation is an inverse to the one,      compatible servers and export their ontologies to other
     disjoint, covered, equivalent, transitive, symmetric,    OKBC knowledge servers. Protégé-2000 uses the
     functional                                               freedom allowed by the OKBC specification to maintain
                                                              the model of structured knowledge acquisition tools and
§    predefined relations are: instance-of - between a        to achieve the second design goal of being a usable and
     class and an instance, semantics: inheritance of slots   extensible tool.
     (values, facets), type-of - an inverse relation to       Protégé fitted almost all of our requirements for the
     instance-of, subclass-of - between two classes,          knowledge editor. The only one noticeable difference was
     semantics: inheritance of slots (values, facets),        in form, how relations are represented in Protégé.
     superclass-of - an inverse relation to subclass-of       Because of freedom of the ontology specification in
§    slot facet values are inherited but can be overwritten   Protége knowledge model, relations are not defined as
     (new value must be more constraining than the old        basic objects [3]. We discuss later in this article, how to
     one)                                                     solve this lack. Other modifications we did to Protégé
                                                              were:
§    multiple inheritance (from more parents) is allowed
                                                              1.   Localisation of Protégé into more languages (at this
§    special classes                                               time it is localised into Slovak version)
     o    THING - represents the root of the class            2.   Adding ability to graphically view classes
          hierarchy                                                structure (Figure 4). It will help the user easily
                                                                   browse ontology in a graphical view. The graph
               §       every defined class is a subclass of
                                                                   layout is computed automatically or can be changed
                       THING,
                                                                   by user.
               §       every instance is an instance of
                       THING
               §       has slot "documentation" with value-
                       type STRING
     o    CLASS - class of all classes
     o    INSTANCES - class of all instances
In current state of the project we needed to offer for our
partners tool for creating and editing ontology. Because
Knowledge Module task starts in our project in future, we
had specified some other requirements for knowledge
editor:
     §    it has to be flexible, to enable            later
          modifications in knowledge model
     §    platform independence
     §    it should enable importing ontologies from
                                                                     Figure 4 Graphview tab for Protégé 2000
          other formats
Thus we dedicated to use some kind of Open Source
knowledge editor programmed in JAVA instead of                7     Representing relations in Protégé
programming new one and to modify it for our purposes.
Tool, which best fitted into mostly all of our requirement    Because relations are not basic Protégé objects, we have
seemed to be Protégé 2000 from Stanford University.           to model them. In the discussion within Protégé
Other knowledge editors we have tested was OntoEdit,          community four possible solutions were proposed:
JOT, GEF, Apollo, SiLRI.
                                                              Option 1
                                                              We can use own slots. This is probably the easiest way to
6     Using Protege 2000 for creating                         go, but it is also the most restrictive one. Here the
      ontologies                                              relations are own slots on all subclasses of the class that
                                                              first specified those slots. The values of the slots are
Protégé-2000 is the latest component-based and platform-      classes that they are related to in one way or another.
independent generation of the ontology editor. Two goals      Advantage:
have driven the design and development of Protégé-2000:       §     Very easy to model
                                                              §     We already have all the interface and underlying
1.   achieving interoperability with other knowledge-               structures in Protégé for this.
     representation systems, and                              Problems:
2.   being an easy-to-use and configurable knowledge-
     acquisition tool.
§    We can not add additional information, such as             Protégé 2000 does not treat DISJOINT or TRANSITIVE
     orientation, in particular, when the value of a slot is    facets in some special way. They are only used by
     a list of classes and not a single class                   reasoning mechanism which will be developed later and
§                                                               will not be a part of Protégé itself.
Option 2 (extension of Option 1)
Use facets on own slots (own slots on own slots) to
specify orientation and other additional properties
Problem:                                                        8     Acknowledgements
§     Too complicated: it is hard even to explain exactly
      how things are going to work.                             This work is done within the Webocracy project “Web
§                                                               Technologies Supporting Direct Participation in
Option 3                                                        Democratic Processes”, which is supported by European
Use template slots. Since slots are first-class objects in      Commission DG INFSO under the IST program, contract
Protégé (they are themselves frames) , it is easy to            no. IST-1999-20364, and within the VEGA project
express attributes of relations such as reflexivity,            1/8131/01 ”Knowledge Technologies for Information
transitivity, etc, as well as a hierarchy of relations (the     Acquisition and Retrieval” of Scientific Grant Agency of
same is true for Option 1).                                     Ministry of Education of the Slovak Republic.
Advantage:
§     Can use advantages of inheritance more extensively.       The content of this publication is the sole responsibility
§     Own slots on classes are harder to explain and            of the authors, and in no way represents the view of the
      understand template slots are easier.                     European Commission or its services.
Problems:
§     It is harder to express additional constraints on
      relations, such as orientation.                           9     References
§
Option 4                                                        [1] Gruber, T., R. (1993): A translation approach to
Relations are themselves classes. We can go one step            portable ontologies. Knowledge Acquisition, 5(2):199-
further and reify relations as classes themselves.              220.
Relations between particular classes are instances of           [2] Mach, M.; Dridi, F.; Furdik, K. (2001): Webocrat
these Relation classes
                                                                System Architecture and Functionality. Webocracy report
Advantages:                                                     2.4.
§     Can easily encode meta-information on relations:          [3] Noy, N., F.; Fergerson, R., W.; Musen, M., A. (2000):
      Reflexive, Transitive, Inverse. All of these              The knowledge model of Protégé-2000: combining
      properties are own slots on a Relation class
                                                                interoperability and flexibility. International Conference
§     Relations can have additional slots, such as              on Knowledge Engineering and Knowledge Management
      orientation, that get instantiated when we define         (EKAW '2000), Juan-les-Pins, France.
      relations between classes.                                [4] Sabol, T.; Jackson, M.; Dridi, F.; Palola, I.; Novacek,
The first advantage also carries over to most of the earlier
                                                                E.; Cizmarik, T.; Thompson, P. (2001): Dissemination
options with the exception that the additional information      and Use Plan. Webocracy report 15.2.1.
(relation attributes, hierarchy) would be on slots and not
classes, which is often harder to understand and
manipulate.
Problem:
§     Specialized browsing that "jumps over" a level to
      view hierarchies of entities based on each relation
      will be needed (for example, view the part-of
      hierarchy).
All of these four options can be combined. Price for this
is then loose of the uniform approach to describing
properties of relations such as transitivity, inverses and so
on.
Option 4 looks like the most suitable one, but it would be
uncomfortable for user to define special class for any
possible type of relation. Since real applications are not
developed yet, we cannot predicate the number of
relations needed.
We decided for option 3. The EXTENDED_SLOT class
has been defined with new facets TRANSITIVE and
DISJOINT. Other attributes can be easily added at any
time. This EXTENDED_SLOT class is set to be default,
so that every new slot that is created on any class is a
subclass of EXTENDED_SLOT and thus it automatically
contains required attributes TRANSITIVE and
DISJOINT. Relation between two objects is modelled as
a slot, where one class of relation contains that slot and
second class is a value of that slot.