=Paper=
{{Paper
|id=Vol-201/paper-11
|storemode=property
|title=OkkaM: Towards a Solution to the ``Identity Crisis'' on the Semantic Web
|pdfUrl=https://ceur-ws.org/Vol-201/33.pdf
|volume=Vol-201
|dblpUrl=https://dblp.org/rec/conf/swap/BouquetSMG06
}}
==OkkaM: Towards a Solution to the ``Identity Crisis'' on the Semantic Web==
O KKA M: Towards a Solution to the “Identity
Crisis” on the Semantic Web
Paolo Bouquet, Heiko Stoermer, Michele Mancioppi and Daniel Giacomuzzi
University of Trento
Dept. of Information and Communication Tech.
Trento, Italy
Email: {bouquet, stoermer, manchioppi, giacomuzzi}@dit.unitn.it
Abstract— One of the pillars of the Semantic Web enterprise is claim is that there are compelling theoretical reasons why the
the idea that if people use standard names for resources (URIs), Semantic Web (and any other semantically driven information
then the integration of information from different distributed system) should not force people to use shared URIs for logical
sources will happen smoothly and efficiently simply by using
URI identity as a key for merging RDF graphs into a single resources, but only (or mostly) practical reasons why people
(virtual) graph. The question this paper will try to address is: do not use shared URIs for entities.
how do we find and reuse an identifier for an entity on the
(Semantic) Web? In this paper propose a system called O KKA M By analogy, our claim can be illustrated by considering the
that is currently under development; we discuss requirements, different difficulty of building white page and yellow page ser-
architecture, usage scenarios and services we have developed so
far to tackle this “Identity Crisis” on the Semantic Web. vices. The former basically requires an efficient mechanism for
listing entities, retrieving them, and distinguishing entities one
I. I NTRODUCTION from another; the latter always presupposes some taxonomy,
which is typically either too general (and therefore does not
In the W3C recommendation Uniform Resource Identifier
help in discriminating services), or too specific (and therefore
(URI): Generic Syntax [1] , a resource is defined as “anything
heavy to master for users), or too complex (not usable).
that has identity”1 . This means that not only web accessible
pages and documents are resources, but also people, cities and
It should be clear that the problem of unique identifiers for
conferences; even the concept of “car” and the property of
resources (in its two flavors: logical resources and entities)
“being the owner” of a car are resources, which can be referred
is crucial for achieving semantic interoperability and efficient
to and described as any other resource (e.g. in an ontology).
knowledge integration. However, it is also evident that 99%
Despite the generality of the Semantic Web approach, here of the research effort is on the problem of (i) designing
we want to suggest that – in practice – there is an essential dif- shared ontologies, or (ii) designing methods for aligning and
ference between managing and reusing identifiers of resources integrating heterogeneous ontologies (with special focus on the
which correspond to “things” (in a very broad sense, ranging T-Box part of the ontology). Perhaps because of its “practical”
from electronic documents to bound books, from people to flavor, we must recognize that only a very limited effort has
cars, from conferences to unicorns) – we will call them entities been devoted to address the issue of identity management
–, and identifiers which correspond to abstract objects (like for entities. For example, ontology editors, such as Protégé,
predicates, relations, assertions) – which we will call logical support the “bad practice” of creating new URIs for any new
resources. Our thesis is the following: while any attempt of instance created in an ontology. In our opinion, this problem is
“forcing” the use of the same URIs for logical resources is not only of general interest for the Semantic Web enterprise,
in principle likely to fail (as every application context has but is one of the most critical gaps in an ideal pipeline from
its own peculiarities, and people tend to have different views data to semantic representation: if we do not have a reliable
even about the same domain2 ), the same does not hold – or (and incremental) method for supporting the reuse of URIs
holds at a level which is philosophically interesting but of for the new entities that are annotated in new documents (or
little practical relevance – for entities. On other words, the any other data source), we risk to produce an archipelago of
1 ‘A resource can be anything that has identity. Familiar examples include “semantic islands” where conceptual knowledge may (or may
an electronic document, an image, a source of information with a consistent not) be integrated (it depends on how we choose the names
purpose (e.g., “today’s weather report for Los Angeles”), a service (e.g., an of classes and properties, and on the availability of cross-
HTTP-to-SMS gateway), and a collection of other resources. A resource is not ontology mappings), but ground knowledge is completely dis-
necessarily accessible via the Internet; e.g., human beings, corporations, and
bound books in a library can also be resources. Likewise, abstract concepts can connected. And since the most valuable knowledge is typically
be resources, such as the operators and operands of a mathematical equation, about individuals, we take this to be an issue that should be
the types of a relationship (e.g., “parent” or “employee”), or numeric values attacked.
(e.g., zero, one, and infinity)’ [1].
2 This is what in [2], which was co-authored by one of the authors of this
paper, was called the distributed knowledge argument. In this paper, we introduce the main requirements and a
prototype implementation of O KKA M3 , a service for sup- An Entity Profile stores untyped data about entities which
porting transparent integration of knowledge about entities will support the human user or an application using the
through simple identity management support. O KKA M can be O KKA M API to process descriptions about entities, and in
described at two different levels: effect enable them to assess whether the entity they want to
• the basic services – which belong to a module called store knowledge about in their own local KB already has a URI
O KKA MC ORE – provide APIs to create and store URIs in O KKA M, or whether they have to create a new one. We store
for entities, to add/modify/remove informal descriptions untyped data for the reason that typing an entity’s attributes
of each entity, to index the resources in which knowledge would require us classify the entity, which would be in contrast
about an entity is provided (e.g. ontologies, web pages); with the abovementioned goal. We do not discriminate types
• on top of O KKA MC ORE , O KKA M offers a collection of of entities, because we explicitly want to be able to provide
advanced services, including searching for already ex- naming and descriptions for any entity.
isting entities (using different search criteria), extracting Of course at first sight one could think about what types
information about entities, ranking results, supporting the of entities would be described in O KKA M, such as persons,
reuse of URIs for entities in ontology editing, and so on. artifacts, locations, companies etc.; this could make it appear
sensible to provide a basic set of typed attributes for these
The structure of the paper is the following: in Sect. II, we in-
entities. But we envision the system also to provide support for
troduce our motivations and the resulting goals for our project
less obvious applications such as Named Entity Recognition
in more detail. After that, in Sect. III the O KKA M system
from the field of Natural Language processing, which we will
architecture and design approaches are illustrated. Sect. IV
talk about later, in Section V. In these applications entities
describes first basic services that have been implemented on
might represent a location or a piece of text in a document, a
top of O KKA MC ORE, whereas Sect. V illustrates two usage
document itself or a collection of documents, and we end up
scenarios we have addressed with the system. We conclude
with an unlimited set of potential types of entries, which makes
with a discussion of issues and an outlook on the further
it impossible to provide a common set of typed attributes
development of O KKA M in Sect. VI.
for. Therefore it is our opinion that only untyped descriptive
metadata can provide for the envisioned level of generality.
II. G OALS AND M OTIVATIONS
As soon as one starts thinking about the idea of an entity
repository, the temptation of building what Craig Knoblock4
called an EntityBase in one of his recent talks, is very strong.
In short, an EntityBase can be thought of as an entity-
centric knowledge base, where knowledge is organized around
entities instead of schemas (e.g. relational schemas or even
ontologies). In such an approach, any entity type would be
characterized by a collection of attributes (for example, for
entities of type book, some attributes can be “author”, “title”,
“date of publication” “publisher”), whose semantics is known
in advance and explicitly specified.
We called this a temptation, as it is extremely appealing Fig. 1. Schematic overview of O KKA M, plus external K/I sources
(we would always know what we know about an entity), but
also very dangerous, as it presupposes a commitment on the In addition to these untyped data, O KKA M provides for the
meaning of an attribute which cannot be guaranteed in most management of what we call ontology references in the Entity
practical situation by a repository which aims at being open, Profile, i.e. a set of URIs to external sources that are known
extensible, global. Therefore, an important requirement for our to store information or knowledge about these entities, as
service is that it is light and fast, which can’t be confused illustrated in Fig. 1. One of the reasons to go in this direction
with yet another attempt in the direction of CYC [3] or was the motivation to make O KKA M provide a possible
SUMO [4], as systems of this type offered useful approaches in solution to integration issues in the Semantic Web. While a
certain areas, but have obviously not contributed to a solution great amount of work has been performed on schema-level
of the identity problem in the Semantic Web. What we are information integration5 , the aspect of entity-level information-
aiming to provide is a naming service for entities and directory and knowledge integration still offers many opportunities for
containing entity profiles, not a knowledge base. providing interesting approaches. One possible application
3 The system is named after Occam, a medieval philosopher whose main 5 It is hardly impossible to cite all related work in this field. Specific to
principle – known as the “Occam’s razor” – was: entities should not be the area of the Semantic Web, the reader is referred e.g. to the publication
multiplied beyond necessity (in Latin: entia non sunt multiplicanda praeter list on http://www.ontologymatching.org for a host of publica-
necessitatem). tions, or the Ontology Alignment Evaluation Initative (http://oaei.
4 Craig Knoblock’s homepage: http://www.isi.edu/ knoblock/. Unfortu- ontologymatching.org/) which performs an alignment contest. For an
nately, at this point no citeable publications about this topic are available. overview more related to the database world, we refer e.g. to [6].
we envision to support with O KKA M is an extension-based biunivocal relation with an Entity Identifier, which is created
equivalence check for classes in an alignment or integration by the system and represents the URI of the entity in O KKA M
process. Currently, in a schema-level integration process with- with which users can identify entities within their application
out extension check, classes can be estimated to be equivalent, or KB. Each Entity may have a Preferred Identifier, provided
but without an extension check the result of this estimation by the user who creates the entity, e.g. to mirror an identifier
cannot be proved. Additionally, without a service that provides used in their information system; the relation among Entities
strong decision support about whether two individuals with the and Preferred Identifiers is labeled as a key because Entities
same name are actually identical or not (which is the current can not share the Preferred Identifiers. Each Entity may any
situation in the Semantic Web), an extension check will hardly number of Alternative Identifiers; similarly to the Preferred
deliver very reliable results. With the help of Entity Profiles Identifiers, every Alternative Identifier can belong to only
in O KKA M we hope to improve this situation, because if we one Entity at a time. To keep the diagram simple, it does
look at a case where two assumedly equivalent classes show not address the fact that there can not be overlap among the
that the sets of O KKA M-registered individuals associated to different types of identifiers. All the identifiers are names for
them are identical, we have very strong reason to support this the Entities on which they are set: then all the identifiers set
equivalence assumption. on a given Entity are synonyms. Since different Entities have
The last component of the Entity Profile is a set of assertions different names, each identifier can appear on at most one
of identity between entities. We provide these for the case Entity, no matter if it acts as an Entity Identifier, a Preferred
where two entities with different URIs in O KKA M are later Identifier or an Alternative Identifier.
discovered to describe the exact same object, and are thus Entities can have any number of Labels set on them. Each
identical. One possible criticism at this point is certainly the Label has a prefix and a value. Label’s prefixes may be left
question how we can know and be certain about identity of empty. Different Labels with the same prefix and value can
entities. The answer is that we cannot. O KKA M will suffer only belong to different Entities; as illustrated in Figure 2, the
from the same garbage-in-garbage-out property as any other triple consisting of the Label’s prefix, the Label’s value and the
information system. But with O KKA M at least we can provide Entity on which the Label is set forms a key. Each Ontology
a means for the Semantic Web to store and represent such Reference has a value; any number of Ontology References
information, and we hope that by consistent use of O KKA M can be set on an Entity. Similarly to the Labels, Ontology
in Semantic Web applications we can strongly improve the References with equivalent value can only belong to different
current situation by enabling agents to gain a certain level entities.
of confidence that they are actually “talking about” the same The Assertions of Identity are uniquely identified by their
objects. Assertion of Identity Identifiers. Each Assertion of Identity
involves exactly two Entities. Different Assertions of Identity
III. O KKA MC ORE : CHARACTERIZING ENTITIES
can not involve the same two Entities.
A. Data Model and API O KKA MC ORE provides its users with functionalities to
manage and retrieve entities and assertions of identity. From
0
1 Assertion prefix value a programmatical point of view, two APIs are provided:
Assertion 0..N 0 of Identity
1
0
1
of Identity
1
has 1
0 0
1
11
0 0 0
1 Publication API: enables publishing, modifying and re-
0
1
Identifier •
110 Label moving of entities and assertions of identity;
11
00
involves has 00
11 • Inquire API: allows retrieval of entities matching a given
0..N
Entity
1
value set of criteria.
11
00
1 1
has
Identifier 2 00 1
11 0 Both APIs offer straightforward functionalities that one would
0
1
expect in such a system; for the sake of brevity we will not
0
1 Entity
1
has
0..N
0
Ontology
1 Reference
11
000
1 describe the full API in this article.
00
11
Preferred 1
1
0
0..1 0..N
has
Identifier
0
1 B. Architecture
0 has
1
synset
identifier The currently available implementation of O KKA MC ORE
1
0
Alternative
1
00
11
is built on top of the J2EE 1.4 platform. The O KKA MC ORE
0..N
Identifier
has 1
Wordnet
application is an Enterprise Application exposing the Publi-
Identifier 0 wnVersion
1 cation and Inquire APIs as Web Services. Its architecture, as
presented in Figure 3, follows the classical three-tier model
Fig. 2. O KKA MC ORE’s Entity Relationship Diagram that subdivides the Presentation Logic (the Web Services), the
Business Logic (carried out by an EJB module), and the Data
The O KKA MC ORE application manages data describing Persistence logic. The Web Services framework we adopted is
entities and assertions of identity between entities. The data Apache Axis26 . The DataPersistence layer is subdivided into
structures we are using to model this are shown in the Entity
Relationship Diagram presented in Figure 2. An Entity is in 6 Apache Axis 2 Home: http://ws.apache.org/axis2/
data is allowed to be stored. In particular, we want to stress
the following:
• first of all, we want to add a new entity only if it is not
already stored in O KKA MC ORE. But this means that we
need smart ways for recognizing if a new candidate entity
is already stored, and for deriving when a new entity
which looks like an entity already stored actually is a new
one. These two requirements are crucial: failing to meet
the first would lead to a lack of completeness (failing to
support inferences which in theory are sound based on the
fact that two names refer to the same entity); failing to
meet the second would lead to a lack of correctness (false
conclusions would be supported, based on the fact that
two different entities have been collapsed onto a single
identifier);
• imagine we detect that an entity is already stored, and
that we find a new occurence of that entity in a document
Fig. 3. O KKA MC ORE Application Architecture where some information about it is provided. Question:
what if the new information conflicts with the old one?
And, even before, how do we detect that there is such an
different modules: i) handling the configurations of the appli- inconsistency?
cation, ii) marshalling the internal representation of the data • as it will be clarified in the section on envisaged ap-
to external form, such as XML documents and iii) commu- plication scenarios, information may be imported in
nication with the database. Thus, implementations exploiting O KKA MC ORE from very different sources, including hu-
different database technologies may be plugged effortlessly. mans (who may be carefully making data entry), ad-hoc
For this reason, during the development of the O KKA MC ORE wrappers designed to import entities from rich sources
application, we implemented two different Data Persistence (e.g. lists of entities from Wikipedia), entity recognition
modules: one based on a native XML database, and another tools (which may be extracting entity descriptions from
one built on top of a relational database. free text). These potential sources may provide very
At first we developed the XML Database based backend be- uneven data, including a lot of garbage, which would
cause it allowed us to have a first prototype of O KKA MC ORE undermine the role of O KKA M as a general and reliable
running in a very short time. The backend is based on the tool;
Open Source database eXist7 . Although the flexibility of the • finally, a theoretical issue which needs to be addressed
XML native database together with the XQuery expressiveness is the following: what does count as an entity? There
enabled us to complete the backend relatively quickly, we is little doubt that people, organizations, cars, computer
experienced scalability issues. It turned out that the number of files, electronic devices, are entities. But, for example,
entities that can be managed by this backend ranges in the tens is a document an entity? Is it an abstract entity, or it is
of thousands, which is way below our desired goal. Although identified with its physical realizations? If so, is every
the XML database based backend performed well for testing copy of a document a different entity? Another example
of the rest of the application, and made the O KKA MC ORE is: are logical resources (like concepts, relations, topics)
application promptly available for tests with the other services entities? Or the entity is the linguistic expression used
built on top of it, we decided to abandon this approach in favor to express a concept? But then are two linguistic formu-
of a relational database based backend. lations of the same concept different entities? And fur-
IV. S ERVICES thermore: are fictitious entities entities? Should we allow
O KKA M can be viewed as a collection of services built on Pegasus and Spider Man to sneak into O KKA MC ORE?
top of O KKA MC ORE. In this section we list the main services And the list can be made much longer8 . . .
which – in our opinion – should belong to O KKA M, describe To address these issues, we have developed the following
their implementation (if available), or present ideas on how compontents:
they could be implemented. • OkkamListsManager
On the WWW there are many lists of entities and are thus
A. Population of O KKA MC ORE
8 We notice that a very practical version of these philosophical questions is
The first important service is the one that supports the
the following: what should be represented as an instance in an OWL ontology?
population of O KKA MC ORE with new entities. In fact, there And what as a class/property? The issue is tricky, and we make only one
are many issues that must be addressed and solved before new example: should “Pizza Margherita” be a class or aan instance of an Italian
food ontology? If we check e.g. [5], we find that the answer to this type of
7 eXist Home: http://exist.sourceforge.net/ questions can be quite disappointing.
a potentially important resource for O KKA M. For exam- is the most precise and complete because the user can
ple Wikipedia provides lists of countries, cities, members provide information that the system can not automatically
of particulars domains (e.g. Presidents of the United discover, and optimize the input in a feedback loop.
States, Computer Scientists, etc) that are exactly the types • Protege Plugin
of entities that we want to store in the system. With the We provide a plugin for the ontology editor Protege which
objective to find a standard mechanism for integrating we describe in further detail in Sect. V-A.
these entities into Okkam we developed a language (an
XML Schema) that describes the input that a data source B. Searching for URIs
has to follow to communicate with the population process Another critical service is searching for identifier of some
of O KKA M. The main elements of the schema follows entity which is known either by some description (e.g. name
the internal structure of Okkam, in fact we have elements for people), or by an identifier which was not issued by
like ”Labels”, ”Label-prefix” or ”Label-value” that are O KKA M. This site should be held very simple, like a tra-
easy to map with the tables elements of OkkamCore. ditional search engine, and based on an easy mechanism to
This language is used by different wrappers that we visualize results. In a standard use case the user types in a
developed and that try to convert the structure of a source keyword associated to an entity and the system searches the
list into the O KKA M input standard. For lists from the repository for instances that match this label. For example
Web (Wikipedia, Yahoo, Google, etc.) the main purpose if a user searches for entities that have the label ”Heiko
of the wrappers is the data cleansing process from HTML Stoermer” the O KKA M Management System will search in the
tags. After this step the entity collections is normalized database the instances that have this keyword and will return
with the objective to delete duplicates. Entities with the the main information about these. The main data will be the
same annotation label are recognized by the system and URI of the entity, the other labels associated to the entity,
the O KKA M administrative user can check if there exist in our example can be ”H.Stoermer”, ”Stoermer”, Mr. Heiko
conflicts from members of the list that are the same entity Stoermer” and the classes of the ontologies where the entity
(from a logical point of view). During insertion, for ach is used. In our example we have different classes as ”Person”
entity the system searches O KKA M if there is already an or ”PHD Student”. The information about the classes where
entity with the same label/s. If yes, this entity is “frozen” other person use the entity, its URI, are very important because
and included in a set of entities that should be checked by with this data the user can chose which entity is the correct
the administrator before addition to the system, otherwise URI in the O KKA M.
it is added immediately. If we have two URI’s that share the same label ”Roma”, but
• OkkamDBManager one is attached to class like ”City” or ”Capital” and other refers
Another important information source for O KKA M can to ”Person” or ”Customer” or ”Employee”, the user can easyily
be generic databases, as far as we have access to them. understand whether he needs the first URI because he wants
Examples might include direct database access to in- to speak about the capital of Italy or the second one because
formation systems such as extranets, online shops or he refers to a Person named ”Roma”. The filtering process,
publishing houses. In this case the transformation from with information about classes, can be performed before the
the internal structure of the tables into the O KKA M input search step: the submitted query can be a pair of ”keyword -
language is easier because the main objective of the class”. This means that the system will return only the URI
process is writing queries that build the link between the that fulfil both the terms of the interrogation. Web O KKA M
database structure and the okkam data structure. When This first use case is very simple and understandable because
the transformation into the input language is completed, the most difficult process, the filtering task, is delegated to the
the rows that come from the database follow the same user responsible for this operation.
process that we already describe with web lists. With The Web site of the O KKA M is not the only application
database sources the role of the user becomes more built on the URI database. There are many situations where it
important because, with high numbers of entities, dupli- is very difficult to believe that users use the web site to search
cation and redundancy are an increasing problem. the URI of the resources that they need. For example, if we
• OkkamManualEntry have a large database with all employees of an organization is
Another solution we provide to insert new entities is impossible that the designers and developers wanting to build
the manual case. A Web interface provides easy access semantic application on this data search in the O KKA M web
to the insert function. The user can add new entities, site all the URI’s of the persons stored in their database. This
with labels, ontology references, etc., to the system using process can be simplified if they can use an automatic service,
a form to specify all the information that he/she want in this case a web service, that provide an access point to
describe the new entity with. As in the previous case, the O KKA M that an application can use. The developers can
if the system finds a possible conflict with entities that build an application that extract the data from their database
are already in Okkam, it issues a warning message that and send them to the web service which will return some
informs the user of the possible error. This methodology results, URI, about the information that already are stored in
of insertion is the slowest that Okkam provides, but it the O KKA M.
V. T WO USAGE SCENARIOS Profile is created in O KKA M and the resulting URI is
A. Runtime support for ontology editing used accordingly. For subsequent discoveries of the same
named entitiy, the same URI will be used to indicate
Another important area for which O KKA M has to provide that the two discovered entities are in fact the same, just
services and applications are existing Semantic Web tools. In in different locations of the document. This approach is
particular, ontology editors are applications where users build equally applicable to discovered coreferences11 . If the
a formalization of part of the world by means of classes and NLP process updates the Entity Profiles in O KKA M
instances of these classes, all identified by URI’s. One of the correctly, we gain direct access to search situations of the
most widely used and important editors is Protege, an open type “show me all documents that talk about this entity”,
source product that can be extend and modified with ”plug- as the respective links would be stored as Ontology
ins” added on the core system. For the O KKA M vision it is of References which we can evaluate and reason about with
high importance to develop a plug-in for this application which a higher-level service.
provides a connection with the URI database when users create • Refinement: Identity Discovery
new instances of a class, which we are doing as illustrated in In the refinement phase, as depicted in Fig. 5, we can
Fig. 5. If a user creates a new instance of a class, instead of address shortcomings of the NLP processes in terms of
assigning an arbitrary, meaningless number as ID the plug-in discovery of identity. The VIKEF pipeline has dedicated
will search the repository whether an URI already exists that a whole processing step to this issue, as – at the named
can be assigned to this new instance. The selection process is entitiy extraction level – it is not always possible to
envisioned similar to the web search use case where a list of detect identity between entities. Obvious examples in this
URI’s that match the label for the new instance are visualized case are missing correspondences between orthographic
to the user. variations hinted at already in Sect. IV-B, e.g. the fact that
Important support for all the selection processes comes within one document there is a certain probability that the
from additional tools, as for example WordNet, that provide strings “Stoermer”, “H. Stoermer” and “Heiko Stoermer”
information about the meaning of the classes used in the denote the exact same individual. With support of the
ontologies where the new instances are created. With this O KKA M system, we have implemented several heuristics
information the system has more data to try to recognize the to address this issue, the simplest performing a substring
correct URI to return to the users or application that query query to O KKA M and using a string similarity measure
O KKA M. on the results to choose candidates for establishing an
assertion of identity between them, and thus to cluster
B. Supporting Knowledge Extraction and Representation
annotations. A higher level process is free to either choose
One of the scenarios we are currently implementing with one single URI for all the annotated entities or to retain
the help of O KKA M is to support Knowledge Extraction (KE) the original URIs, as it is always possible to perform
processes and the resulting Knowledge Representation (KR) clustering via analysis of identity assertions in O KKA M.
in a Semantic Web project9 that aims at building a large-scale
Knowledge Base (KB) from information stored in distributed VI. D ISCUSSION AND C ONCLUSION
document bases. The architecture comprises a pipeline of O KKA M is the typical example of an application which
processes that covers all steps from KE to the building of the is not based on some radically new scientific result, but
KB (the so-called Semantic Resource Network) for end-user aims at filling a gap by using existing technologies in a
services, as illustrated in Fig. 5 new way. In our opinion, without O KKA M (or a similar
Within the pipeline there are several points of application service), most Semantic Web promises will never be kept, as
imaginable, two of which we have currently implemented and it provides a sort of bottom level for integration which cannot
are further described here: be achieved ex post when the ball stops. However, the fact
• Information Extraction: Named Entity Recognition and that the basic technologies are already available should not
Coreference lead us to underestimate the critical factors which may affect
Whenever our NLP process recognizes a named entity the success and adoption of O KKA M. In addition to aspects
in a piece of text, it interacts with O KKA M to analyze already discussed throughout the paper, we identify acceptance
whether this named entity already has a unique URI. issues in the form that not every party involved in the Semantic
If yes, the NLP process stores locally10 the fact that a Web may be willing to use a centrally managed service that
uniquely identified entity has been discovered with addi- is outside of their control. Privacy issues include all the well-
tional information such as its location in the document, known aspects of data security, access management, privacy
etc. If the entity does not have a URI yet, an Entity etc. that almost all public information systems share. Last, but
not least there are of course questions of offered features and
9 see http://www.vikef.net for further information about the VIKEF
project. 11 A coreference is a linguistic pattern typically involving pronouns when
10 In fact, the annotations created in this phase are stored in an XML file, talking about an object that has previously been named. Example: “Peter is a
which is later refined and then used as a base for the generation of RDF good runner. He does 10k in 45 minutes.” The personal pronoun he establishes
annotations that will be fed into a large knowledge base. the coreference in this case.
Fig. 4. A Protege plugin for generating individuals registered in O KKA M.
Fig. 5. Knowledge pipeline to be supported by O KKA M
functionality, such as a really efficient and intelligend search do with existence).
and ranking mechanism for Entity Profiles in O KKA M, as well From a design perspective, what we propose is to keep these
as performance and scalability issues which are again common two tasks separated: on the one hand, we need a universal and
to most information systems. Our planned next steps are to non ambiguous way to refer to the entities about which an
address exactly these issues in the form of further research and agent may have some knowledge; on the other hand, we need a
by developing additional services on top of O KKA MC ORE. way to specify knowledge about these entities. We believe that
We conclude with the statement that currently, when creat- the help of O KKA M this goal can be achieved more cleanly
ing ontologies, people actually perform two different tasks: for the Semantic Web, as to existing methods of specifying
they specify a conceptualization, and then “populate” such knowledge in the form of ontologies and knowledge bases
a conceptualization with instances by assigning instances to we add an identity and reference architecture with a central
some class and specifying the values for properties (if any). It character that enables systems and agents to ensure that they
is a trivial observation that the same domain (set of entities) “talk” and store knowledge about the same entities, if these
may be used to populate different ontologies (e.g. we may have objects share the same identifier.
two different conceptualizations of Italian wines&food), and
VII. ACKNOWLEDGMENTS
that any two ontologies (e.g. an ontology about semantic web
researchers and another about people living in Italy) may have This research was partially funded by the European Com-
overlapping domains. Creating a conceptual schema and then mission under the 6th Framework Programme IST Integrated
populating it with instances address two different issues: the Project VIKEF - Virtual Information and Knowledge Envi-
first is an epistemological issue (it has to do with knowledge ronment Framework (Contract no. 507173, Priority 2.3.1.7
about the world), the second is an ontological issue (it has to Semantic-based Knowledge Systems; more information at
http://www.vikef.net)
R EFERENCES
[1] T. Berners-Lee, R. Fielding, and L. Masinter. RFC 3986: Uniform
Resource Identifier (URI): Generic Syntax. IETF (Internet Engineering
Task Force), 2005. http://www.ietf.org/rfc/rfc3986.txt.
[2] Matteo Bonifacio, Paolo Bouquet, and Paolo Traverso. Enabling dis-
tributed knowledge management: Managerial and technological implica-
tions. Informatik - Zeitschrift der schweizerischen Informatikorganisatio-
nen, 1:23–29, 2002.
[3] Douglas B. Lenat. Cyc: A large-scale investment in knowledge infras-
tructure. Commun. ACM, 38(11):32–38, 1995.
[4] Ian Niles and Adam Pease. Towards a standard upper ontology. In FOIS,
pages 2–9, 2001.
[5] Natalya F. Noy and Deborah L. McGuinness. Ontology Development
101: A Guide to Creating Your First Ontology. Stanford University.
http://protege.stanford.edu/publications/ontology development/ontology101.html.
[6] Erhard Rahm and Philip A. Bernstein. A survey of approaches to
automatic schema matching. VLDB Journal: Very Large Data Bases,
10(4):334–350, 2001.