INTRODUCTION

When owl:sameAs isn't the Same: An Analysis of Identity Links on the Semantic Web

Harry Halpin

H.Halpin@ed.ac.uk 0

Patrick J. Hayes

phayes@ihmc.us 1 0 Institute for Communicating and Collaborative , Systems , University of Edinburgh , 2 Buccleuch Place, Edinburgh , United Kingdom 1 Institute for Human and Machine Cognition , 40 South Alcaniz St., Pensacola , USA

2010

In Linked Data, the use of owl:sameAs is ubiquitous in 'inter-linking' data-sets. However, there is a lurking suspicion within the Linked Data community that this use of owl:sameAs may be somehow incorrect, in particular with regards to its interactions with inference. In fact, owl:sameAs can be considered just one type of 'identity link,' a link that declares two items to be identical in some fashion. After reviewing the definitions and history of the problem of identity in philosophy and knowledge representation, we outline four alternative readings of owl:sameAs, showing with examples how it is being (ab)used on the Web of data. Then we present possible solutions to this problem by introducing alternative identity links that rely on named graphs.

Linked Data ontology resource Web architecture

INTRODUCTION

As large numbers of independently developed data-sets have been introduced to the Web as Linked Data, the vexing problem of identity has returned with a vengeance to the Semantic Web. As the ubiquitous owl:sameAs property is used as the RDF property to connect these data-sets, it has been dubbed the owl:sameAs problem. However, the problem of identity lies not within Linked Data or within the Semantic Web languages, but is an outstanding and wellknown – if sometimes not precisely labeled – issue in preSemantic Web knowledge representation languages in artificial intelligence. What precisely is new in its latest guise of this problem on the Web of Linked Data is that this is the first time the problem is being encountered by different individuals attempting to independently knit their knowledge representations together using the same standardized language. Much of the supposed “crisis” over the proliferation of owl:sameAs in Linked Data can be traced to the fact that these uses of owl:sameAs tend to be mutually incompatible, and almost always violate the rather strict logical semantics of identity demanded by owl:sameAs. However, the exact types of distinctions made by these individuals are important, even if they contradict the relevant specification of owl:sameAs. First, these uses and abuses of owl:sameAs demonstrate for the first time in the history of knowledge representation how precisely these problems play out in the wild. Second, as the Semantic Web is a project in development, it is always possible to specify anew different and new kinds of language constructs and more clearly specified best practices to align the specifications with the actual empirical use of the Semantic Web in the wild.

First, we will give an overview of the problem of identity and its somewhat dusty lineage in artificial intelligence, if only to show how what was already a known issue for knowledge representation becomes even more exacerbated when knowledge representation goes global for the Semantic Web. Then, four distinct uses of owl:sameAs are discussed in addition to the precise idea of “same thing as,” namely:

Finally, a number of suggestions for how the current situation can be improved are sketched. The necessity of both semantic and theoretical work is given as well. 2.

A BRIEF HISTORY OF IDENTITY

The problem of identity has a long and chequered history, spreading from philosophy and mathematics to linguistics and knowledge representation. In each of these fields, what it means for two things to be identical goes straight to the heart of semantics. 2.1

What is Identity?

The father of knowledge representation, Leibnitz, was not surprisingly also the first to phrase a coherent and formalizable definition of identity, often called ‘Leibnitz’s Law’ or the ‘The Identity of Indiscernables,’ namely that for every x and every y, whatever is true of x is true of y, then x is identical to y [ 1 ]. This notion of identity states that identity is composed of properties, so that in order for two things to be the same they must share all the same properties. This law can then be stated logically as ∀x∀y(∀P.(P (x) ↔ P (y)) → x = y). If x = y, then they are the same thing, so of course all the properties of x are also properties of the y: there is only one thing there, to either have the property or not have it.

A number of classical problems already crop up in this analysis of identity. First, exactly what properties are being counted? Obviously, we can imagine worlds where things have the exact same properties but are nonetheless not identical, such as an exact clone (which is all too easy with digital objects). There are two obvious escape hatches: a thing may have a property of “being=itself” (an [haecceity) or things having different temporal-spatial co-ordinates could be counted as different, even if they share the rest of their properties in common. While that sounds like a common-sense distinction, is it true that Tim Berners-Lee is the same person he was when he was a child? Or if he lost his arm? This leads straight into arguments about perdurance and endurance in philosophy. Are there two different kinds of properties, properties that are somehow intrinsic to identity and others that are extrinsic, i.e. purely relative to other things? Lastly, perhaps the real question is who or what determines the conditions of identity, namely that identity can only be made in context of an explicit theory of identity criteria. These theories can be formalized in terms of the semantic interpretation of sentences to referents. However, this does not mean that these theories are compatible. If one has a theory of identity where one is talking only about people as employees in a particular role, then two different people who have the same job will be identical, but if one has a more fine-grained interpretation the very same people would be different. One can even imagine theories of identity based on different criteria, where some theories of identity subsume weaker or stronger ones. There is also the problem of vagueness, and the inability to specify precise properties (such as the exact latitude and longitude boundaries of Mount Everest). Regardless of the problems, the point of Leibnitz’s Law is clear: When someone says two things are the same, they mean they share all the same properties.

Frege was the first to note what Galois called the “linguistic counterpart” to Leibnitz’s Law: the Principle of Substitution, which states that if x and y are identical, then x may be substituted for y without changing the truth-value of the sentence in which the substitution is made [ 6 ]. Formally, this is generally phrased in the precise same manner as Leibnitz’s Law. However, there is a subtle difference, for while Leibnitz’s Law dealt with the identity of concepts and individuals on the level of properties, the Principle of Substitution deals with when the use of the name itself in the context of actual sentences.

There are two important consequences. The first is the classic division between use and mentioning of a name. Even if “Mount Everest” in English and “Qomolangma” in Tibetan mean the same thing, names from different languages cannot be substituted for each other often. Sentences like “Qomolangma is a word in Tibetan” mention a name, while the sentence “Mount Everest is the highest mountain in the world” uses the name. Obviously, “Mount Everest is a word in Tibetan” is false. Is “Qomolangma is the highest mountain in the world” true? This fact was not necessarily known in Tibet before the era of global geological surveys. So one could easily have a case of a geoscientist who has never visited the mountain knowing it is the highest mountain in the world and a Tibetan monk who lives not too far from the mountain not knowing - or caring - if it is the highest mountain in the world. This distinction was called the distinction between two names having the same referent but different senses, i.e. contexts that do or do not share certain information [ 7 ]. Often contexts where a name can be substituted are called extensional, while referentially opaque contexts where a name can not be substituted are intensional. In general, indirect quotations and statements of belief, such as “Rajendra Pachauri believes the glaciers on Mount Everest are melting” are considered opaque. Although in practice the principle of substitution is subtle and its use often wrought with confusion, the key point is straightforward: A name can identify different things in different contexts. 2.2

The IS-A Debate in Semantic Networks

It would be easy to dismiss these arguments over identity as being mandarin philosophical questions, until the ‘pedal hits the metal’ in the the world of knowledge representation. This is precisely what happened to semantic networks, a predecessor of the Semantic Web in knowledge representation. Semantic networks, as pioneered by Quillian, were viewed as an alternative knowledge representation scheme to firstorder logic in the early days of artificial intelligence [ 12 ]. In essence, a semantic network is similar to an RDF graph except that instead of using URIs, the nodes and edges were labeled purely using natural language or pseudo-natural language labels.

Semantic networks, by relying on words from natural language or pseudo-words to label their constructs, whose meanings were somehow supposed to be simply obvious, actually led to these constructs being ambiguous. The classic example was the infamous IS-A label used by Brachman in his What IS-A is and isn’t. An Analysis of Taxonomic Links in Semantic Networks [ 3 ]. Often, two nodes were connected by an IS-A link. Were IS-A links assertional, such that somehow two nodes connected by an IS-A were identical? Or were they taxonomic, such that they meant a sub-class or subset relationship? Or a structural relationship between a concept and an instance? Brachman found that there were a proliferation of the various meanings of IS-A links in semantic networks, and that not only were they incompatible between different semantic networks, but that within a single network IS-A links were often given different meanings within the same network [ 3 ]. Given this lack of clarity about exactly what was being represented by the knowledge representation, semantic networks could not be transferred or combined with each other with any degree of reliability.

In an effort to remedy this crisis, Brachman and others decided to split what they called the “epistemological level” - the kinds of nodes and edges that remained neutral to the underlying primitives yet could be given a specific semantics from the rest of the semantic network whose meaning could only be grounded in some linguistic convention [ 4 ]. These logical constructs could be given a formal semantics (and thus a model theory) by mapping them to a language with a well-defined semantics, such as first-order predicate calculus. Therefore, semantic networks could be considered just an intuitive (or slightly odd, depending on your preferences) notation for logic.

The Semantic Web seems to have learned from semantic networks. The formal semantics of RDF are important precisely because RDF statements can be given the same logical meaning uniformly across a distributed network, even if the semantics of RDF have relatively light inferential power and do not constrain the semantic interpretations very tightly [ 8 ]. This is also on purpose, as it allows RDF to be - in theory - used as a foundation or “glue” for other more constrained vocabularies. Furthermore, it was precisely the explorations of the semantics of “semantic” networks that led to description logic, and so OWL. By giving OWL and RDF a formal semantics - albeit a very limited one - it was imagined that the Semantic Web would not repeat the mistakes of semantic networks.

THE IDENTITY CRISIS OF LINKED DATA

Contrary to popular belief in some circles, formal semantics are not a silver bullet. Just because a construct in a knowledge representation language is prescribed a behavior using formal semantics does not necessarily mean that people will follow those semantics when actually using that language “in the wild.” This can be laid down to a wide variety of reasons. In particular, the language may not provide the facilities needed by people as they actually try to encode knowledge, so they may use a construct that seems close enough to their desired one. A combination of not reading specifications - especially formal semantics, which even most software developers and engineers lack training in - and the labeling of constructs with “English-like” mnemonics naturally will lead to the use of a knowledge representation language by actual users that varies from what its designers intended. In decentralized systems like the Semantic Web, this problem is naturally exacerbated. However, far from being a sign of abuse, it is a sign of success, as it means that the Semantic Web is actually being deployed outside academia and research labs.

The problem has definitely arisen on the Semantic Web in terms of the use of owl:sameAs in Linked Data. In Linked Data, each item of interest is given a URI, that in turn redirects to either human-readable HTML or machinereadable RDF depending on content negotiation. The URI for the item itself, which is called rather confusingly a “noninformation resource” in Linked Data circles, as a web-page or RDF graph would be an information resource, as the “ distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message” [ 9 ]. Usually, this data is released in some sort of automated or semi-automated manner, often by mapping relational data to RDF. Somehow, a URI is chosen for each identifier in the data-set that is exported in Linked Data. Although the general thinking in RDF (and thus, the main idea behind the ability of RDF graph merge) was that URIs would be re-used, in practice URIs are simply minted anew for each identifier in a Linked Data set. As opposed to the simple exporting of data-sets into RDF, what puts the links in Linked Data is the use of what we term identity links - links that define two things to be identical or otherwise closely related - to link between diverse and heterogeneous data-sets. While there has been some research that deals with this problem [ 10 ], the scope of the problem is just beginning to be understood.

The most typical link used is owl:sameAs, which is in general used to to say “that two URI references actually refer to the same thing” [ 2 ]. For example, the city of Paris is referenced in a number of different Linked Data-sets: ranging from OpenCyc to the New York Times. In DBPedia, a Linked Data export of Wikipedia, these data-sets are connected by owl:sameAs. In particular, dbpedia:Paris is owl:sameAs as both the opencyc:CityOfParisFrance and opencyc:Paris DepartmentFrance, as OpenCyc distinguishes that the department of Paris from Paris itself, as Paris DepartmentFrance is a distinct geopolitical entity from CityOfParisFrance, despite the fact that both share the same territory, while Wikipedia does not make this distinction. 4.

THE SEMANTICS OF OWL:SAMEAS AND ALTERNATIVES

At first, this use of owl:sameAs seems to be harmless. Its actual definition is that “the built-in OWL property owl:sameAs links an individual to an individual” and “Such an owl:sameAs statement indicates that two URI references actually refer to the same thing: the individuals have the same identity” [ 13 ]. Furthermore, OWL states that “It is unrealistic to assume everybody will use the same name to refer to individuals. That would require some grand design, which is contrary to the spirit of the web” [ 13 ].

However, owl:sameAs does have a particular semantics of individual identity, namely that the two individuals are exactly the same and so share all the same properties, and thus are equivalent in terms of Leibnitz’s identity of indiscernables. Given that OWL has no unique name assumption, once there is an application of owl:sameAs to two different URIs, then any statement that is given to a single URI is true for every other URI that has an owl:sameAs link. Furthermore, while in OWL Full owl:sameAs can be considered to be the same as between any URIs as classes can be considered “individual” instances of other classes and properties can be considered individuals, in OWL DL in order to preserve decidability individuals are strictly separated from classes, and so one should use OWL DL equivalentClass and equivalentProperty instead. Therefore, quick-and-dirty use of owl:sameAs will almost always lead to OWL Full, which very little work has been done on in terms of efficient implementations of inference. The real trick with owl:sameAs is that it works both ways: as it is both symmetric and transitive, so that anyone can link to your data-set with owl:sameAs from anywhere else on the Web without your permission, and any statement they make about their own URI will immediately apply to yours. As imaginable, such transitive closures can immediately get very large. There have been considerable rumors in the Linked Data community that such use of owl:sameAs is somehow “wrong” with regards to the formal semantics of OWL. It does seem intuitively that the use of owl:sameAs may be the logical equivalent of a bulldozer. Since inference is rarely used on the Linked Data, these possible side-effects have not been noticed. Does this really matter? Is the use of owl:sameAs an exploding time-bomb for Linked Data, or a harmless convention? What exactly is the point of linking data if nobody is going to draw any conclusions which use the links? 5.

FOUR VARIATIONS OF IDENTITY IN LINKED DATA 5.1 Same Thing As But Referentially Opaque

The first case is when the two URIs do refer to the same thing, but all the properties ascribed to one URI are not necessarily accepted by the other. This means that the use of the URI is referentially opaque, which means that one URI cannot be substituted for another (the Principle of Substitution is violated), i.e. the context is intensional. A classic example of this would be the the concept of sodium in DBpedia, which has an owl:sameAs link to the concept of sodium in OpenCyc. The OpenCyc ontology says that an element is the set (class) of all pieces of the pure element, so that for example sodium in Cyc has a member which is the lump of pure metallic sodium. On the other hand, sodium as defined by DBPedia is used to also include isotopes, which have different number of neutrons than “standard” sodium. So, one should not state the number of neutrons in DBPedia’s use of sodium, but one can with OpenCyc. Therefore, owl:sameAs here is in error, as it does not allow mutual substitutivity. Indeed, this use of URIs in an opaque referential context is likely what most uses of owl:sameAs actually are for, as it is unlikely that most deployers of Linked Data actually check whether or not all the properties and their associated inferences are shared amongst linked data-sets. This property is exceedingly important for Linked Data, as contrary to popular doctrine, it is possible that the Web is full of referentially opaque contexts. The problem is there is no way to get a handle on contexts informally without descending into non-logical reasoning currently. 5.2

Same Thing As But Different Context

In this case, two URIs do refer to the same thing and all properties do hold of both URIs, but that we cannot re-use the URI in a different context. The central intuition here is there are ’forms of reference’ appropriate to a context, especially in social contexts. To use an example from Lynn Stein, when at a meeting of the PTA (Parent-Teacher Association) she is Ms. Stein, Rachel’s mum, not Professor Stein of MIT. This does not mean that in the PTA meeting Ms. Stein is somehow not a professor, but that within that context those properties do not matter. At first, this distinction may not seem directly relevant to linked data, provided we keep ’name’ in the social sense distinct from ’identifier’ in the Web sense. However, this distinction raises other issues about what kind of ’names’ URIs really are and precisely why certain properties for linked data are given in the RDF description of a certain URI and others are not. 5.3

Represents

Often identity is conflated with representation. While the term “representation” is often very contentious, its intuitive definition is that, just as a picture of something depicts something, a URI can be for a representation of a thing rather than the thing itself. Intuitively, there seems to be a clear-cut line between that which represents something (the representation) and that which is represented (the referent), sometimes called the relationship between a “sign” and a “signifier.” However, the relationship is often not as clear-cut as we would lead ourselves to believe. In fact, in human natural language use-mention confusions are ubiquitous and often useful. For example, often a web-page or an e-mail address are used to refer to a person. Rather than yell at the world to get an education in philosophical logic, it may be better to clarify this relationship. It also might be worth distinguishing between using a representation to refer to the represented, such as using a picture of Berners-Lee to refer to Tim Berners-Lee himself, using something accidentally or contextually to refer to something, a phenomenon called displaced reference. The example of using an e-mail box to refer to a person is not an error but rather more a displaced reference. 5.4

Sometimes its clear that two things are not identical but simply closely related in some manner. This, for example, is the relationship between the district of Paris and the Department of Paris in Cyc. Furthermore, there are often complex, structured, yet hard-to-specify relationships between things, such as the relationship between isotopes and elements, the quantity and a measurement of a quantity, and an image and a facsimile of that image. In web architecture, it is clear there is a close relationship These relationships that are ‘very similar to’ seem to deserve their own property, but are often currently lumped together in Linked Data under the all-encompassing use of owl:sameAs. Most of the more noticeable errors of owl:sameAs seem to come from this category, and it is likely that examples such as the relationship of sodium within DBPedia to sodium in OpenCyc are of this kind as well.

MOVING FORWARD 6. 6.1 Same Thing As But Referentially Opaque

Surprisingly, most of the time people use owl:sameAs they are accidentally doing what is sort of an implicit import of statements of the subject of the owl:sameAs statement. Obviously, to address the weaker identity implied by the referentially opaque use of identity, a weaker version of owl:sameAs should be specified that does not import all the properties in a full transitive closure.Somewhat similar predicates already exist in SKOS as skos:exactMatch and skos:closeMatch, but their use seems rare in Linked Data [ 11 ] and they require domains and ranges of SKOS concepts. As most Linked Data does not actually do much inference, one in reality only imports what statements are actually used. So could continue using owl:sameAs with a kind of ‘importer beware’ principle. Informally, it is one thing to link to your URI, but its another thing to believe what you say about it as though you were talking about my URI. Put another way, one should be wary of accepting conclusions over here that could have been drawn over there, so to speak. 6.2

Same Thing As But Different Context

There is already a notion of context built into RDF, namely named graphs [ 5 ]. Even though it is not part of the official standard (albeit, snuck into RDF through SPARQL and implemented in almost every tool-set), it is clear that part of the problem with owl:sameAs usage on the Semantic Web is that sameAs should not always be a statement between two URIs in a unqualified manner, but may be qualified as holding only within a certain named graph. Furthermore, noting the that the use of owl:sameAs is somewhat equivalent of an accidental usage of owl:imports, although the exact behavior of this construct has only been intuitively (although not formally) specified. These implicit imports should probably either be separated, so that one states at first that two items are identical using the weaker form of identity given above, and then independently if one feels strongly about that the two URIs are not referentially opaque, one imports all (or even some of) the associated properties of the “identical” resource. Furthermore, there could be an inverse of this implicit importing of identity, where statements that are imported due to the transitive closure of owl:sameAs are not imported. This would allow a more fine-grained measure of control over the use of identity in named graphs. 6.3

Represents

The use of owl:sameAs is already a sort of statement of this kind in the FOAF vocabulary, the foaf:isPrimaryTopicOf statement. One possible solution to this problem would be to wrap such a property into some core W3C approved standard. However, the problem is that it is unclear if a strict separation between mention and use is necessary or even desirable. In many contexts, as relevant experience in OpenID deployment shows, using an e-mail as an identifier for a person is often more natural than the URI of a home-page, or even a “non-information resource.” What is needed however, is a way to make the distinctions that either conflate or separate mention and use or on the fly. The use of weak identity statements - and in this case, a “represents” statement - and explicit importing and de-importing of properties within the context of particular named graphs would allow us to do state things like “Within this named graph and only within this named graph, the e-mail address URI is identical to the person and shares their properties” and “Within this other named graph, the e-mail address represents the person, but does not have all the properties of that person.” 6.4

Very Similar To

Again, the tempting easy solution is simply to introduce a new predicate for “very similar to.” The SKOS vocabulary has a number of “matching” predicates that are close in meaning to this, ranging from hierarchically structured skos:broadMatch and skos:narrowMatch to the more suitable skos:closeMatch. However, the main issue with these predicates is that again, their use may be a matter of opinion, as someone’s close match may be another person’s identical match. One is also tempted to engage with some sort of “fuzzy” or numerically weighted uncertainty measure between one and zero of identity, but then the real hard questions of where precisely will these real values come from and their relationship to actual probability theory muddies these conceptual waters quickly. It seems that beneath this apparently simple property is likely a whole family of heterogeneous and semi-structured identity relationships that should be studied more carefully and empirically observed before any hasty judgments are made.

CONCLUSION

It is possible to do empirical studies of exactly how people use owl:sameAs in the wild. Examples of owl:sameAs can be taken from the Linked Data Web in the wild in order to determine how experimentally robust these distinctions are would be, i.e. do people actually use owl:sameAs in the four ways that are outlined above, and are there more possible ways that we are not aware of? In fact, even the ability to recognize these kinds of distinctions may vary quite wildly by background and training. Lastly, if a number of empirical distinctions between identity links that are currently conflated by owl:sameAs can be made in a robust manner, then there is considerable formal semantic work to be done. Giving the Linked Data community well-defined (both formally and informally) predicates should be done even when one does think of the properties given to URIs as absolute truths given by Linked Data publishers or W3C specifications, but as functions of their actual use. The (ab)use of owl:sameAs in Linked Data is not a threat, it’s an opportunity.

[1]

Alexander . The Leibniz-Clarke correspondence . Manchester University Press, Manchester, United Kingdom, 1956 . Republished 1998 .

[2]

Bizer ,

Cygniak , and

Heath . How to publish Linked Data on the Web , 2007 . http://www4.wiwiss.fuberlin.de/bizer/pub/LinkedDataTutorial/ (Last accessed on May 28th 2008 ).

[3]

Brachman . What IS-A is and isn't: An analysis of taxonomic links in semantic networks . IEEE Computer , 16 ( 10 ): 30 - 36 , 1983 .

[4]

Brachman and

Schmolze . An overview of the KL-ONE knowledge representation system . Cognitive Science , 9 ( 2 ): 151 - 160 , 1985 .

[5]

Carroll ,

Bizer ,

Hayes , and

Stickler . Named graphs . Journal of Web Semantics , 4 ( 3 ): 247 - 267 , 2005 .

[6]

Frege . Begriffsschrift, eine der arithmetischen nachgebildete Formelsprache des reinen Denkens . Halle, Germany, 1879 .

[7]

Frege . Uber Sinn und Bedeutung. Zeitshrift fur Philosophie and philosophie Kritic , 100 : 25 - 50 , 1892 . Reprinted in The Philosophical Writings of Gottlieb Frege ( 1956 ), Blackwell, Oxford, United Kingdom ( 1956 ), translated by Max Black.

[8]

Hayes and

Halpin . In defense of ambiguity. International Journal of Semantic Web and Information Systems , 4 ( 3 ), 2008 .

[9]

Jacobs and

Walsh . Architecture of the World Wide Web. Technical report, W3C , 2004 . http://www.w3.org/TR/webarch/ (Last accessed Oct 12th 2008 ).

[10]

Jaffri ,

Glaser , and I. Millard. Managing URI synonymity to enable consistent reference on the Semantic Web . In Proceedings of the Workshop on Identity, Reference, and the Web (IRSW) at ESWC2008 , 2008 .

[11]

Miles and

Bechhofer. SKOS Simple Knowledge Organization System reference . W3c recommendation, W3C , 2008 . http://www.w3.org/TR/skos-reference/.

[12]

M. R.

Quillian . Semantic memory . In M. Minsky, editor, Semantic Information Processing , pages 216 - 270 . MIT Press, Cambridge, Massachusetts, USA, 1968 .

[13]

Welty ,

Smith ,

and D.

McGuinness. OWL Web Ontology Language Guide. Recommendation , W3C, 2004 . http://www.w3.org/TR/2004/REC-owl-guide20040210.