O KKA M: Towards a Solution to the “Identity
                 Crisis” on the Semantic Web
                          Paolo Bouquet, Heiko Stoermer, Michele Mancioppi and Daniel Giacomuzzi
                                                            University of Trento
                                              Dept. of Information and Communication Tech.
                                                                Trento, Italy
                                      Email: {bouquet, stoermer, manchioppi, giacomuzzi}@dit.unitn.it


   Abstract— One of the pillars of the Semantic Web enterprise is                 claim is that there are compelling theoretical reasons why the
the idea that if people use standard names for resources (URIs),                  Semantic Web (and any other semantically driven information
then the integration of information from different distributed                    system) should not force people to use shared URIs for logical
sources will happen smoothly and efficiently simply by using
URI identity as a key for merging RDF graphs into a single                        resources, but only (or mostly) practical reasons why people
(virtual) graph. The question this paper will try to address is:                  do not use shared URIs for entities.
how do we find and reuse an identifier for an entity on the
(Semantic) Web? In this paper propose a system called O KKA M                        By analogy, our claim can be illustrated by considering the
that is currently under development; we discuss requirements,                     different difficulty of building white page and yellow page ser-
architecture, usage scenarios and services we have developed so
far to tackle this “Identity Crisis” on the Semantic Web.                         vices. The former basically requires an efficient mechanism for
                                                                                  listing entities, retrieving them, and distinguishing entities one
                           I. I NTRODUCTION                                       from another; the latter always presupposes some taxonomy,
                                                                                  which is typically either too general (and therefore does not
    In the W3C recommendation Uniform Resource Identifier
                                                                                  help in discriminating services), or too specific (and therefore
(URI): Generic Syntax [1] , a resource is defined as “anything
                                                                                  heavy to master for users), or too complex (not usable).
that has identity”1 . This means that not only web accessible
pages and documents are resources, but also people, cities and
                                                                                     It should be clear that the problem of unique identifiers for
conferences; even the concept of “car” and the property of
                                                                                  resources (in its two flavors: logical resources and entities)
“being the owner” of a car are resources, which can be referred
                                                                                  is crucial for achieving semantic interoperability and efficient
to and described as any other resource (e.g. in an ontology).
                                                                                  knowledge integration. However, it is also evident that 99%
    Despite the generality of the Semantic Web approach, here                     of the research effort is on the problem of (i) designing
we want to suggest that – in practice – there is an essential dif-                shared ontologies, or (ii) designing methods for aligning and
ference between managing and reusing identifiers of resources                     integrating heterogeneous ontologies (with special focus on the
which correspond to “things” (in a very broad sense, ranging                      T-Box part of the ontology). Perhaps because of its “practical”
from electronic documents to bound books, from people to                          flavor, we must recognize that only a very limited effort has
cars, from conferences to unicorns) – we will call them entities                  been devoted to address the issue of identity management
–, and identifiers which correspond to abstract objects (like                     for entities. For example, ontology editors, such as Protégé,
predicates, relations, assertions) – which we will call logical                   support the “bad practice” of creating new URIs for any new
resources. Our thesis is the following: while any attempt of                      instance created in an ontology. In our opinion, this problem is
“forcing” the use of the same URIs for logical resources is                       not only of general interest for the Semantic Web enterprise,
in principle likely to fail (as every application context has                     but is one of the most critical gaps in an ideal pipeline from
its own peculiarities, and people tend to have different views                    data to semantic representation: if we do not have a reliable
even about the same domain2 ), the same does not hold – or                        (and incremental) method for supporting the reuse of URIs
holds at a level which is philosophically interesting but of                      for the new entities that are annotated in new documents (or
little practical relevance – for entities. On other words, the                    any other data source), we risk to produce an archipelago of
   1 ‘A resource can be anything that has identity. Familiar examples include     “semantic islands” where conceptual knowledge may (or may
an electronic document, an image, a source of information with a consistent       not) be integrated (it depends on how we choose the names
purpose (e.g., “today’s weather report for Los Angeles”), a service (e.g., an     of classes and properties, and on the availability of cross-
HTTP-to-SMS gateway), and a collection of other resources. A resource is not      ontology mappings), but ground knowledge is completely dis-
necessarily accessible via the Internet; e.g., human beings, corporations, and
bound books in a library can also be resources. Likewise, abstract concepts can   connected. And since the most valuable knowledge is typically
be resources, such as the operators and operands of a mathematical equation,      about individuals, we take this to be an issue that should be
the types of a relationship (e.g., “parent” or “employee”), or numeric values     attacked.
(e.g., zero, one, and infinity)’ [1].
   2 This is what in [2], which was co-authored by one of the authors of this
paper, was called the distributed knowledge argument.                               In this paper, we introduce the main requirements and a
prototype implementation of O KKA M3 , a service for sup-                           An Entity Profile stores untyped data about entities which
porting transparent integration of knowledge about entities                      will support the human user or an application using the
through simple identity management support. O KKA M can be                       O KKA M API to process descriptions about entities, and in
described at two different levels:                                               effect enable them to assess whether the entity they want to
   • the basic services – which belong to a module called                        store knowledge about in their own local KB already has a URI
     O KKA MC ORE – provide APIs to create and store URIs                        in O KKA M, or whether they have to create a new one. We store
     for entities, to add/modify/remove informal descriptions                    untyped data for the reason that typing an entity’s attributes
     of each entity, to index the resources in which knowledge                   would require us classify the entity, which would be in contrast
     about an entity is provided (e.g. ontologies, web pages);                   with the abovementioned goal. We do not discriminate types
   • on top of O KKA MC ORE , O KKA M offers a collection of                     of entities, because we explicitly want to be able to provide
     advanced services, including searching for already ex-                      naming and descriptions for any entity.
     isting entities (using different search criteria), extracting                  Of course at first sight one could think about what types
     information about entities, ranking results, supporting the                 of entities would be described in O KKA M, such as persons,
     reuse of URIs for entities in ontology editing, and so on.                  artifacts, locations, companies etc.; this could make it appear
                                                                                 sensible to provide a basic set of typed attributes for these
   The structure of the paper is the following: in Sect. II, we in-
                                                                                 entities. But we envision the system also to provide support for
troduce our motivations and the resulting goals for our project
                                                                                 less obvious applications such as Named Entity Recognition
in more detail. After that, in Sect. III the O KKA M system
                                                                                 from the field of Natural Language processing, which we will
architecture and design approaches are illustrated. Sect. IV
                                                                                 talk about later, in Section V. In these applications entities
describes first basic services that have been implemented on
                                                                                 might represent a location or a piece of text in a document, a
top of O KKA MC ORE, whereas Sect. V illustrates two usage
                                                                                 document itself or a collection of documents, and we end up
scenarios we have addressed with the system. We conclude
                                                                                 with an unlimited set of potential types of entries, which makes
with a discussion of issues and an outlook on the further
                                                                                 it impossible to provide a common set of typed attributes
development of O KKA M in Sect. VI.
                                                                                 for. Therefore it is our opinion that only untyped descriptive
                                                                                 metadata can provide for the envisioned level of generality.
                  II. G OALS AND M OTIVATIONS
   As soon as one starts thinking about the idea of an entity
repository, the temptation of building what Craig Knoblock4
called an EntityBase in one of his recent talks, is very strong.
In short, an EntityBase can be thought of as an entity-
centric knowledge base, where knowledge is organized around
entities instead of schemas (e.g. relational schemas or even
ontologies). In such an approach, any entity type would be
characterized by a collection of attributes (for example, for
entities of type book, some attributes can be “author”, “title”,
“date of publication” “publisher”), whose semantics is known
in advance and explicitly specified.
   We called this a temptation, as it is extremely appealing                         Fig. 1.   Schematic overview of O KKA M, plus external K/I sources
(we would always know what we know about an entity), but
also very dangerous, as it presupposes a commitment on the                          In addition to these untyped data, O KKA M provides for the
meaning of an attribute which cannot be guaranteed in most                       management of what we call ontology references in the Entity
practical situation by a repository which aims at being open,                    Profile, i.e. a set of URIs to external sources that are known
extensible, global. Therefore, an important requirement for our                  to store information or knowledge about these entities, as
service is that it is light and fast, which can’t be confused                    illustrated in Fig. 1. One of the reasons to go in this direction
with yet another attempt in the direction of CYC [3] or                          was the motivation to make O KKA M provide a possible
SUMO [4], as systems of this type offered useful approaches in                   solution to integration issues in the Semantic Web. While a
certain areas, but have obviously not contributed to a solution                  great amount of work has been performed on schema-level
of the identity problem in the Semantic Web. What we are                         information integration5 , the aspect of entity-level information-
aiming to provide is a naming service for entities and directory                 and knowledge integration still offers many opportunities for
containing entity profiles, not a knowledge base.                                providing interesting approaches. One possible application
   3 The system is named after Occam, a medieval philosopher whose main             5 It is hardly impossible to cite all related work in this field. Specific to
principle – known as the “Occam’s razor” – was: entities should not be           the area of the Semantic Web, the reader is referred e.g. to the publication
multiplied beyond necessity (in Latin: entia non sunt multiplicanda praeter      list on http://www.ontologymatching.org for a host of publica-
necessitatem).                                                                   tions, or the Ontology Alignment Evaluation Initative (http://oaei.
   4 Craig Knoblock’s homepage: http://www.isi.edu/ knoblock/. Unfortu-          ontologymatching.org/) which performs an alignment contest. For an
nately, at this point no citeable publications about this topic are available.   overview more related to the database world, we refer e.g. to [6].
we envision to support with O KKA M is an extension-based                                                biunivocal relation with an Entity Identifier, which is created
equivalence check for classes in an alignment or integration                                             by the system and represents the URI of the entity in O KKA M
process. Currently, in a schema-level integration process with-                                          with which users can identify entities within their application
out extension check, classes can be estimated to be equivalent,                                          or KB. Each Entity may have a Preferred Identifier, provided
but without an extension check the result of this estimation                                             by the user who creates the entity, e.g. to mirror an identifier
cannot be proved. Additionally, without a service that provides                                          used in their information system; the relation among Entities
strong decision support about whether two individuals with the                                           and Preferred Identifiers is labeled as a key because Entities
same name are actually identical or not (which is the current                                            can not share the Preferred Identifiers. Each Entity may any
situation in the Semantic Web), an extension check will hardly                                           number of Alternative Identifiers; similarly to the Preferred
deliver very reliable results. With the help of Entity Profiles                                          Identifiers, every Alternative Identifier can belong to only
in O KKA M we hope to improve this situation, because if we                                              one Entity at a time. To keep the diagram simple, it does
look at a case where two assumedly equivalent classes show                                               not address the fact that there can not be overlap among the
that the sets of O KKA M-registered individuals associated to                                            different types of identifiers. All the identifiers are names for
them are identical, we have very strong reason to support this                                           the Entities on which they are set: then all the identifiers set
equivalence assumption.                                                                                  on a given Entity are synonyms. Since different Entities have
   The last component of the Entity Profile is a set of assertions                                       different names, each identifier can appear on at most one
of identity between entities. We provide these for the case                                              Entity, no matter if it acts as an Entity Identifier, a Preferred
where two entities with different URIs in O KKA M are later                                              Identifier or an Alternative Identifier.
discovered to describe the exact same object, and are thus                                                  Entities can have any number of Labels set on them. Each
identical. One possible criticism at this point is certainly the                                         Label has a prefix and a value. Label’s prefixes may be left
question how we can know and be certain about identity of                                                empty. Different Labels with the same prefix and value can
entities. The answer is that we cannot. O KKA M will suffer                                              only belong to different Entities; as illustrated in Figure 2, the
from the same garbage-in-garbage-out property as any other                                               triple consisting of the Label’s prefix, the Label’s value and the
information system. But with O KKA M at least we can provide                                             Entity on which the Label is set forms a key. Each Ontology
a means for the Semantic Web to store and represent such                                                 Reference has a value; any number of Ontology References
information, and we hope that by consistent use of O KKA M                                               can be set on an Entity. Similarly to the Labels, Ontology
in Semantic Web applications we can strongly improve the                                                 References with equivalent value can only belong to different
current situation by enabling agents to gain a certain level                                             entities.
of confidence that they are actually “talking about” the same                                               The Assertions of Identity are uniquely identified by their
objects.                                                                                                 Assertion of Identity Identifiers. Each Assertion of Identity
                                                                                                         involves exactly two Entities. Different Assertions of Identity
        III. O KKA MC ORE : CHARACTERIZING ENTITIES
                                                                                                         can not involve the same two Entities.
A. Data Model and API                                                                                       O KKA MC ORE provides its users with functionalities to
                                                                                                         manage and retrieve entities and assertions of identity. From
                                        0
                                        1 Assertion                             prefix           value   a programmatical point of view, two APIs are provided:
  Assertion                      0..N   0 of Identity
                                        1
                                        0
                                        1
  of Identity
                       1
                           has                                                         1
                                                                                       0         0
                                                                                                 1
                                               11
                                               0  0                                               0
                                                                                                  1          Publication API: enables publishing, modifying and re-
                                                                                                  0
                                                                                                  1
   Identifier                                                                                              •
                                                110                                      Label               moving of entities and assertions of identity;
                                                                        11
                                                                        00
                                             involves            has    00
                                                                        11                                 • Inquire API: allows retrieval of entities matching a given
                                                                             0..N
    Entity
                                                         1
                                                                                             value           set of criteria.
                                                                                            11
                                                                                            00
                       1          1
                           has
  Identifier                                     2                                          00 1
                                                                                            11 0         Both APIs offer straightforward functionalities that one would
                                                                                               0
                                                                                               1
                                                                                                         expect in such a system; for the sake of brevity we will not
                                   0
                                   1   Entity
                                                             1
                                                                       has
                                                                                0..N
                                                                                       0
                                                                                           Ontology
                                                                                       1 Reference
                                 11
                                 000
                                   1                                                                     describe the full API in this article.
                                 00
                                 11
  Preferred                      1
                                    1
                                    0
                    0..1                   0..N
                           has
  Identifier
                                     0
                                     1                                                                   B. Architecture
                                     0 has
                                     1
                                                                   synset
                                                                  identifier                                The currently available implementation of O KKA MC ORE
                                                         1
                                                         0
  Alternative
                                                 1
                                                         00
                                                         11
                                                                                                         is built on top of the J2EE 1.4 platform. The O KKA MC ORE
                    0..N
  Identifier
                           has        1
                                            Wordnet
                                                                                                         application is an Enterprise Application exposing the Publi-
                                            Identifier   0 wnVersion
                                                         1                                               cation and Inquire APIs as Web Services. Its architecture, as
                                                                                                         presented in Figure 3, follows the classical three-tier model
                Fig. 2.    O KKA MC ORE’s Entity Relationship Diagram                                    that subdivides the Presentation Logic (the Web Services), the
                                                                                                         Business Logic (carried out by an EJB module), and the Data
   The O KKA MC ORE application manages data describing                                                  Persistence logic. The Web Services framework we adopted is
entities and assertions of identity between entities. The data                                           Apache Axis26 . The DataPersistence layer is subdivided into
structures we are using to model this are shown in the Entity
Relationship Diagram presented in Figure 2. An Entity is in                                                6 Apache Axis 2 Home: http://ws.apache.org/axis2/
                                                                   data is allowed to be stored. In particular, we want to stress
                                                                   the following:
                                                                      • first of all, we want to add a new entity only if it is not
                                                                        already stored in O KKA MC ORE. But this means that we
                                                                        need smart ways for recognizing if a new candidate entity
                                                                        is already stored, and for deriving when a new entity
                                                                        which looks like an entity already stored actually is a new
                                                                        one. These two requirements are crucial: failing to meet
                                                                        the first would lead to a lack of completeness (failing to
                                                                        support inferences which in theory are sound based on the
                                                                        fact that two names refer to the same entity); failing to
                                                                        meet the second would lead to a lack of correctness (false
                                                                        conclusions would be supported, based on the fact that
                                                                        two different entities have been collapsed onto a single
                                                                        identifier);
                                                                      • imagine we detect that an entity is already stored, and
                                                                        that we find a new occurence of that entity in a document
           Fig. 3.   O KKA MC ORE Application Architecture              where some information about it is provided. Question:
                                                                        what if the new information conflicts with the old one?
                                                                        And, even before, how do we detect that there is such an
different modules: i) handling the configurations of the appli-         inconsistency?
cation, ii) marshalling the internal representation of the data       • as it will be clarified in the section on envisaged ap-
to external form, such as XML documents and iii) commu-                 plication scenarios, information may be imported in
nication with the database. Thus, implementations exploiting            O KKA MC ORE from very different sources, including hu-
different database technologies may be plugged effortlessly.            mans (who may be carefully making data entry), ad-hoc
For this reason, during the development of the O KKA MC ORE             wrappers designed to import entities from rich sources
application, we implemented two different Data Persistence              (e.g. lists of entities from Wikipedia), entity recognition
modules: one based on a native XML database, and another                tools (which may be extracting entity descriptions from
one built on top of a relational database.                              free text). These potential sources may provide very
   At first we developed the XML Database based backend be-             uneven data, including a lot of garbage, which would
cause it allowed us to have a first prototype of O KKA MC ORE           undermine the role of O KKA M as a general and reliable
running in a very short time. The backend is based on the               tool;
Open Source database eXist7 . Although the flexibility of the         • finally, a theoretical issue which needs to be addressed
XML native database together with the XQuery expressiveness             is the following: what does count as an entity? There
enabled us to complete the backend relatively quickly, we               is little doubt that people, organizations, cars, computer
experienced scalability issues. It turned out that the number of        files, electronic devices, are entities. But, for example,
entities that can be managed by this backend ranges in the tens         is a document an entity? Is it an abstract entity, or it is
of thousands, which is way below our desired goal. Although             identified with its physical realizations? If so, is every
the XML database based backend performed well for testing               copy of a document a different entity? Another example
of the rest of the application, and made the O KKA MC ORE               is: are logical resources (like concepts, relations, topics)
application promptly available for tests with the other services        entities? Or the entity is the linguistic expression used
built on top of it, we decided to abandon this approach in favor        to express a concept? But then are two linguistic formu-
of a relational database based backend.                                 lations of the same concept different entities? And fur-
                          IV. S ERVICES                                 thermore: are fictitious entities entities? Should we allow
   O KKA M can be viewed as a collection of services built on           Pegasus and Spider Man to sneak into O KKA MC ORE?
top of O KKA MC ORE. In this section we list the main services          And the list can be made much longer8 . . .
which – in our opinion – should belong to O KKA M, describe           To address these issues, we have developed the following
their implementation (if available), or present ideas on how       compontents:
they could be implemented.                                            • OkkamListsManager
                                                                        On the WWW there are many lists of entities and are thus
A. Population of O KKA MC ORE
                                                                     8 We notice that a very practical version of these philosophical questions is
   The first important service is the one that supports the
                                                                   the following: what should be represented as an instance in an OWL ontology?
population of O KKA MC ORE with new entities. In fact, there       And what as a class/property? The issue is tricky, and we make only one
are many issues that must be addressed and solved before new       example: should “Pizza Margherita” be a class or aan instance of an Italian
                                                                   food ontology? If we check e.g. [5], we find that the answer to this type of
  7 eXist Home: http://exist.sourceforge.net/                      questions can be quite disappointing.
  a potentially important resource for O KKA M. For exam-             is the most precise and complete because the user can
  ple Wikipedia provides lists of countries, cities, members          provide information that the system can not automatically
  of particulars domains (e.g. Presidents of the United               discover, and optimize the input in a feedback loop.
  States, Computer Scientists, etc) that are exactly the types      • Protege Plugin
  of entities that we want to store in the system. With the           We provide a plugin for the ontology editor Protege which
  objective to find a standard mechanism for integrating              we describe in further detail in Sect. V-A.
  these entities into Okkam we developed a language (an
  XML Schema) that describes the input that a data source         B. Searching for URIs
  has to follow to communicate with the population process           Another critical service is searching for identifier of some
  of O KKA M. The main elements of the schema follows             entity which is known either by some description (e.g. name
  the internal structure of Okkam, in fact we have elements       for people), or by an identifier which was not issued by
  like ”Labels”, ”Label-prefix” or ”Label-value” that are         O KKA M. This site should be held very simple, like a tra-
  easy to map with the tables elements of OkkamCore.              ditional search engine, and based on an easy mechanism to
  This language is used by different wrappers that we             visualize results. In a standard use case the user types in a
  developed and that try to convert the structure of a source     keyword associated to an entity and the system searches the
  list into the O KKA M input standard. For lists from the        repository for instances that match this label. For example
  Web (Wikipedia, Yahoo, Google, etc.) the main purpose           if a user searches for entities that have the label ”Heiko
  of the wrappers is the data cleansing process from HTML         Stoermer” the O KKA M Management System will search in the
  tags. After this step the entity collections is normalized      database the instances that have this keyword and will return
  with the objective to delete duplicates. Entities with the      the main information about these. The main data will be the
  same annotation label are recognized by the system and          URI of the entity, the other labels associated to the entity,
  the O KKA M administrative user can check if there exist        in our example can be ”H.Stoermer”, ”Stoermer”, Mr. Heiko
  conflicts from members of the list that are the same entity     Stoermer” and the classes of the ontologies where the entity
  (from a logical point of view). During insertion, for ach       is used. In our example we have different classes as ”Person”
  entity the system searches O KKA M if there is already an       or ”PHD Student”. The information about the classes where
  entity with the same label/s. If yes, this entity is “frozen”   other person use the entity, its URI, are very important because
  and included in a set of entities that should be checked by     with this data the user can chose which entity is the correct
  the administrator before addition to the system, otherwise      URI in the O KKA M.
  it is added immediately.                                           If we have two URI’s that share the same label ”Roma”, but
• OkkamDBManager                                                  one is attached to class like ”City” or ”Capital” and other refers
  Another important information source for O KKA M can            to ”Person” or ”Customer” or ”Employee”, the user can easyily
  be generic databases, as far as we have access to them.         understand whether he needs the first URI because he wants
  Examples might include direct database access to in-            to speak about the capital of Italy or the second one because
  formation systems such as extranets, online shops or            he refers to a Person named ”Roma”. The filtering process,
  publishing houses. In this case the transformation from         with information about classes, can be performed before the
  the internal structure of the tables into the O KKA M input     search step: the submitted query can be a pair of ”keyword -
  language is easier because the main objective of the            class”. This means that the system will return only the URI
  process is writing queries that build the link between the      that fulfil both the terms of the interrogation. Web O KKA M
  database structure and the okkam data structure. When           This first use case is very simple and understandable because
  the transformation into the input language is completed,        the most difficult process, the filtering task, is delegated to the
  the rows that come from the database follow the same            user responsible for this operation.
  process that we already describe with web lists. With              The Web site of the O KKA M is not the only application
  database sources the role of the user becomes more              built on the URI database. There are many situations where it
  important because, with high numbers of entities, dupli-        is very difficult to believe that users use the web site to search
  cation and redundancy are an increasing problem.                the URI of the resources that they need. For example, if we
• OkkamManualEntry                                                have a large database with all employees of an organization is
  Another solution we provide to insert new entities is           impossible that the designers and developers wanting to build
  the manual case. A Web interface provides easy access           semantic application on this data search in the O KKA M web
  to the insert function. The user can add new entities,          site all the URI’s of the persons stored in their database. This
  with labels, ontology references, etc., to the system using     process can be simplified if they can use an automatic service,
  a form to specify all the information that he/she want          in this case a web service, that provide an access point to
  describe the new entity with. As in the previous case,          the O KKA M that an application can use. The developers can
  if the system finds a possible conflict with entities that      build an application that extract the data from their database
  are already in Okkam, it issues a warning message that          and send them to the web service which will return some
  informs the user of the possible error. This methodology        results, URI, about the information that already are stored in
  of insertion is the slowest that Okkam provides, but it         the O KKA M.
                    V. T WO USAGE SCENARIOS                                           Profile is created in O KKA M and the resulting URI is
A. Runtime support for ontology editing                                               used accordingly. For subsequent discoveries of the same
                                                                                      named entitiy, the same URI will be used to indicate
   Another important area for which O KKA M has to provide                            that the two discovered entities are in fact the same, just
services and applications are existing Semantic Web tools. In                         in different locations of the document. This approach is
particular, ontology editors are applications where users build                       equally applicable to discovered coreferences11 . If the
a formalization of part of the world by means of classes and                          NLP process updates the Entity Profiles in O KKA M
instances of these classes, all identified by URI’s. One of the                       correctly, we gain direct access to search situations of the
most widely used and important editors is Protege, an open                            type “show me all documents that talk about this entity”,
source product that can be extend and modified with ”plug-                            as the respective links would be stored as Ontology
ins” added on the core system. For the O KKA M vision it is of                        References which we can evaluate and reason about with
high importance to develop a plug-in for this application which                       a higher-level service.
provides a connection with the URI database when users create                       • Refinement: Identity Discovery
new instances of a class, which we are doing as illustrated in                        In the refinement phase, as depicted in Fig. 5, we can
Fig. 5. If a user creates a new instance of a class, instead of                       address shortcomings of the NLP processes in terms of
assigning an arbitrary, meaningless number as ID the plug-in                          discovery of identity. The VIKEF pipeline has dedicated
will search the repository whether an URI already exists that                         a whole processing step to this issue, as – at the named
can be assigned to this new instance. The selection process is                        entitiy extraction level – it is not always possible to
envisioned similar to the web search use case where a list of                         detect identity between entities. Obvious examples in this
URI’s that match the label for the new instance are visualized                        case are missing correspondences between orthographic
to the user.                                                                          variations hinted at already in Sect. IV-B, e.g. the fact that
   Important support for all the selection processes comes                            within one document there is a certain probability that the
from additional tools, as for example WordNet, that provide                           strings “Stoermer”, “H. Stoermer” and “Heiko Stoermer”
information about the meaning of the classes used in the                              denote the exact same individual. With support of the
ontologies where the new instances are created. With this                             O KKA M system, we have implemented several heuristics
information the system has more data to try to recognize the                          to address this issue, the simplest performing a substring
correct URI to return to the users or application that query                          query to O KKA M and using a string similarity measure
O KKA M.                                                                              on the results to choose candidates for establishing an
                                                                                      assertion of identity between them, and thus to cluster
B. Supporting Knowledge Extraction and Representation
                                                                                      annotations. A higher level process is free to either choose
   One of the scenarios we are currently implementing with                            one single URI for all the annotated entities or to retain
the help of O KKA M is to support Knowledge Extraction (KE)                           the original URIs, as it is always possible to perform
processes and the resulting Knowledge Representation (KR)                             clustering via analysis of identity assertions in O KKA M.
in a Semantic Web project9 that aims at building a large-scale
Knowledge Base (KB) from information stored in distributed                                     VI. D ISCUSSION AND C ONCLUSION
document bases. The architecture comprises a pipeline of                            O KKA M is the typical example of an application which
processes that covers all steps from KE to the building of the                   is not based on some radically new scientific result, but
KB (the so-called Semantic Resource Network) for end-user                        aims at filling a gap by using existing technologies in a
services, as illustrated in Fig. 5                                               new way. In our opinion, without O KKA M (or a similar
   Within the pipeline there are several points of application                   service), most Semantic Web promises will never be kept, as
imaginable, two of which we have currently implemented and                       it provides a sort of bottom level for integration which cannot
are further described here:                                                      be achieved ex post when the ball stops. However, the fact
   • Information Extraction: Named Entity Recognition and                        that the basic technologies are already available should not
     Coreference                                                                 lead us to underestimate the critical factors which may affect
     Whenever our NLP process recognizes a named entity                          the success and adoption of O KKA M. In addition to aspects
     in a piece of text, it interacts with O KKA M to analyze                    already discussed throughout the paper, we identify acceptance
     whether this named entity already has a unique URI.                         issues in the form that not every party involved in the Semantic
     If yes, the NLP process stores locally10 the fact that a                    Web may be willing to use a centrally managed service that
     uniquely identified entity has been discovered with addi-                   is outside of their control. Privacy issues include all the well-
     tional information such as its location in the document,                    known aspects of data security, access management, privacy
     etc. If the entity does not have a URI yet, an Entity                       etc. that almost all public information systems share. Last, but
                                                                                 not least there are of course questions of offered features and
  9 see http://www.vikef.net for further information about the VIKEF
project.                                                                            11 A coreference is a linguistic pattern typically involving pronouns when
  10 In fact, the annotations created in this phase are stored in an XML file,   talking about an object that has previously been named. Example: “Peter is a
which is later refined and then used as a base for the generation of RDF         good runner. He does 10k in 45 minutes.” The personal pronoun he establishes
annotations that will be fed into a large knowledge base.                        the coreference in this case.
                                 Fig. 4.   A Protege plugin for generating individuals registered in O KKA M.


                                           Fig. 5.   Knowledge pipeline to be supported by O KKA M


functionality, such as a really efficient and intelligend search         do with existence).
and ranking mechanism for Entity Profiles in O KKA M, as well               From a design perspective, what we propose is to keep these
as performance and scalability issues which are again common             two tasks separated: on the one hand, we need a universal and
to most information systems. Our planned next steps are to               non ambiguous way to refer to the entities about which an
address exactly these issues in the form of further research and         agent may have some knowledge; on the other hand, we need a
by developing additional services on top of O KKA MC ORE.                way to specify knowledge about these entities. We believe that
   We conclude with the statement that currently, when creat-            the help of O KKA M this goal can be achieved more cleanly
ing ontologies, people actually perform two different tasks:             for the Semantic Web, as to existing methods of specifying
they specify a conceptualization, and then “populate” such               knowledge in the form of ontologies and knowledge bases
a conceptualization with instances by assigning instances to             we add an identity and reference architecture with a central
some class and specifying the values for properties (if any). It         character that enables systems and agents to ensure that they
is a trivial observation that the same domain (set of entities)          “talk” and store knowledge about the same entities, if these
may be used to populate different ontologies (e.g. we may have           objects share the same identifier.
two different conceptualizations of Italian wines&food), and
                                                                                              VII. ACKNOWLEDGMENTS
that any two ontologies (e.g. an ontology about semantic web
researchers and another about people living in Italy) may have             This research was partially funded by the European Com-
overlapping domains. Creating a conceptual schema and then               mission under the 6th Framework Programme IST Integrated
populating it with instances address two different issues: the           Project VIKEF - Virtual Information and Knowledge Envi-
first is an epistemological issue (it has to do with knowledge           ronment Framework (Contract no. 507173, Priority 2.3.1.7
about the world), the second is an ontological issue (it has to          Semantic-based Knowledge Systems; more information at
http://www.vikef.net)
                             R EFERENCES
[1] T. Berners-Lee, R. Fielding, and L. Masinter. RFC 3986: Uniform
    Resource Identifier (URI): Generic Syntax. IETF (Internet Engineering
    Task Force), 2005. http://www.ietf.org/rfc/rfc3986.txt.
[2] Matteo Bonifacio, Paolo Bouquet, and Paolo Traverso. Enabling dis-
    tributed knowledge management: Managerial and technological implica-
    tions. Informatik - Zeitschrift der schweizerischen Informatikorganisatio-
    nen, 1:23–29, 2002.
[3] Douglas B. Lenat. Cyc: A large-scale investment in knowledge infras-
    tructure. Commun. ACM, 38(11):32–38, 1995.
[4] Ian Niles and Adam Pease. Towards a standard upper ontology. In FOIS,
    pages 2–9, 2001.
[5] Natalya F. Noy and Deborah L. McGuinness. Ontology Development
    101: A Guide to Creating Your First Ontology. Stanford University.
    http://protege.stanford.edu/publications/ontology development/ontology101.html.
[6] Erhard Rahm and Philip A. Bernstein. A survey of approaches to
    automatic schema matching. VLDB Journal: Very Large Data Bases,
    10(4):334–350, 2001.