=Paper=
{{Paper
|id=None
|storemode=property
|title=Towards Identity in Linked Data 
|pdfUrl=https://ceur-ws.org/Vol-614/owled2010_submission_12.pdf
|volume=Vol-614
|dblpUrl=https://dblp.org/rec/conf/owled/McCuskerM10
}}
==Towards Identity in Linked Data ==
<pdf width="1500px">https://ceur-ws.org/Vol-614/owled2010_submission_12.pdf</pdf>
<pre>
              Towards Identity in Linked Data

                James P. McCusker and Deborah L. McGuinness

                           Tetherless World Constellation
                          Department of Computer Science
                          Rensselaer Polytechnic Institute
                        110 8th Street Troy, NY 12180, USA
                             {mccusj,dlm}@cs.rpi.edu
                                http://tw.rpi.edu


      Abstract. Many Linked Data applications have come to rely on owl:same-
      As for linking datasets. However, the current semantics for owl:sameAs
      assert that identity entails isomorphism, or that if a=b, then all state-
      ments of a and b are shared by both. This becomes problematic when
      dealing with provenance, context, and imperfect representations, all of
      which are endemic issues in Linked Data. Merging provenance can be
      problematic or even catastrophic in biomedical applications that demand
      access to provenance information. We use examples in biospecimen man-
      agement, experimental metadata representations, and personal identity
      in Friend-of-a-Friend (FOAF) to demonstrate some of the problems that
      can arise with the use of owl:sameAs. We also show that the existence of
      an isomorphic owl:sameAs can be inconsistent with current expectations
      in a number of our use cases. We present a solution that allows the extrac-
      tion of isomorphic statements without requiring their direct assertion.
      We also introduce a set of identity properties that can be extended for
      domain-specific purposes while maintaining clarity of definition within
      each property.

      Key words: owl:sameAs, identity, linked data, inferencing


1   Introduction
The adoption of Linked Data by many communities has resulted in a wealth of
useful data that can be combined in many ways. One current practice in Linked
Data, and therefore in many resources on the Semantic Web, is to link datasets
using owl:sameAs to define that two entities between those datasets are the
same. On its face, this is a good idea because a higher level best practice is to
re-use properties when possible and appropriate and owl:sameAs is a property
that seems easy to understand and is part of OWL [1], which is one of the
foundations of the semantic web and Linked Data.
    The property owl:sameAs is a very strict notion of identity. Its definition
stems from that of mathematical identity, most specifically, from the Indiscerni-
bility of Identicals [2], or a = b ∧ p(a, x) ⇒ p(b, x). This law is true for all true
statements about a and is the basis for isomorphism in owl:sameAs. Halpin and
2        Towards Identity in Linked Data

Hayes [3] identify four different uses of owl:sameAs in Linked Data: (1) Same
Thing As But Different Context, (2) Same Thing As But Referentially Opaque,
(3) Represents (4) Very Similar To. They question if the use of owl:sameAs in
Linked Data is truly a harmless convention. Some have argued that since Linked
Data applications rarely use inference, one should not worry about computa-
tional problems that would result from owl:sameAs inference. We argue that
at least some types of Linked Data applications (such as those shown in our
examples) do benefit from and actually require inference.
    We argue that one major issue with the use of owl:sameAs in Linked Data
and the Semantic Web is related to the inference of isomorphism. We present
some biomedical examples that help to describe some problems with owl:sameAs
and isomorphic inference. Finally, we discuss a set of possible identity properties
and show how the current uses outlined by Halpin and Hayes and other examples
that we describe fit into the new framework.


2      Problems with Identity in Linked Data
We present some examples that help to describe problems with owl:sameAs and
isomorphic inferencing. For instance, Jaffri et al. discusses a number of erroneous
mappings of owl:sameAs in DBpedia and DBLP [4]. Ding et al. [5] discusses a
number of issues around combining FOAF profiles using owl:sameAs. Finally,
Jain et al. [6] argues that more expressive semantics should be used to improve
the richness of Linked Open Data (LoD) Cloud.
    There are also issues with creating Linked Data while maintaining provenance
information. An example that we have found is in biomedical data. In Fig. 1, we
show a common derivation graph of a tumor T that was removed from a patient
A. A cell line was derived from the tumor which resulted in two specimens, LB
and LA, where LA was the original cell colony and LB was derived from it by
taking a part of LA and growing it on its own. There are datasets (D, E) that
use (LB, LA) respectively, and a scientist would like to integrate those datasets.
Through discussions with practicing scientists, we became aware of situations
where owl:sameAs is used to accomplish this. For example, the use of owl:sameAs
in Fig. 2 allows a scientist to integrate the datasets shown. However, isomorphism
is applied to all other properties of the specimens, resulting in ambiguous or
contradictory information. For instance, both specimens are inferred to have
been created on both 8/31 and 9/20, and have a quantity of both 5 and 10
grams. Also, specimen LA now seems to be derived from itself. It has been
noted that these properties could be set as annotations, which would mean that
they are not subject to isomorphism. However, most of the properties specified
here are first-class data as specified by biospecimen management systems and
would not be considered annotations by the originating system. There are other
issues with this sort of inference. It is now very difficult to address the following
potential problems that can arise:
    – The data doesn’t look right. What were the methods and protocols, and how
      consistent were they, going back to surgical resection?
                                            Towards Identity in Linked Data        3

 – Did the “same cell line” actually come from the same tumor, or just from
   the same patient? Or even different patients?
 – What originally seemed to be a primary breast cancer or lung cancer is now
   a metastasized melanoma. How do we sort this out?
 – Is a histology slide made from a tumor the same as the tumor? What about
   the tissue microarray, the cell culture, or the isolated molecular material?

None of these issues are problems that always come up, but when they do, it is
critical that the provenance of biospecimens remains distinguishable.
    Some have argued that these problems can be surmounted by putting the
offending statements in a named graph, or by attaching provenance statements
as OWL 2 annotations on individuals. These solutions may work for limited con-
texts, but there are issues with both. Relying on named graphs is problematic
because the statements may be embedded in other datasets which are already
enclosed in named graphs. The user would have to extract identity statements
from the original named graph and move them to another. Using annotations
would work well if the problem were limited to provenance, and provenance were
easily segregated into annotation properties. However, one data model’s prove-
nance is another’s data. Biospecimen management systems concern themselves
almost exclusively with what would be considered provenance to the experimen-
talist and analyst. It must be possible to realize ex post facto that large sections
of a data model are considered to be provenance by other users. As we will
show, this problem is not confined to provenance, but applies more generally to
statements of knowledge. Neither of these solutions address these issues.


Fig. 1. A scientist has datasets D and E, but D and E refer to different instances of
the same cell line.
4      Towards Identity in Linked Data


Fig. 2. As in Fig. 1, but with the assertion owl:sameAs(LA,LB), then D and E can
be integrated because they can refer to the same specimens. However, doing so means
that there are now multiple values for some important properties and LA appears to
have been derived from itself.


3   A Hypothesis About the use of owl:sameAs in the
    Semantic Web
There are some additional problems with using the indiscernibility of identicals
in knowledge representation discussed by Saul [7]. While the question of exactly
what knowledge is is an ongoing philosophical problem [8, 9], it is generally
agreed upon that for an entity to know a statement, it must first believe that
statement to be true. This of course doesn’t imply that the statement is true.
False beliefs are not by any means rare. Because of this, a number of issues that
apply to belief also apply to epistemology and knowledge representation. Since
the semantic web is concerned with knowledge and not truth values [10–12], these
issues also apply to the semantic web. Below, we discuss some problems described
by Kripke [13], Saul [7], and Pitt [14], but convert beliefs into assertions, which
are more applicable to a semantic web context. One classic problem concerns
secret identities:
 1. Lois Lane claims Superman can fly.
 2. Lois Lane claims Clark Kent cannot fly.
 3. Superman and Clark Kent are the same person.
We can then infer that Lois Lane claims that Superman cannot fly, and that
Clark Kent can. These statements in combination with the assertions generate
contradictions. Similar problems arise even without belief statements:
                                            Towards Identity in Linked Data      5

 4. Clark Kent is Superman’s secret identity.
 5. Therefore, Superman is Superman’s secret identity.
This is nonsensical, as a secret identity must be different from a public identity.
This is also true of changes in identity over time:
 6. I never made it to Constantinople, but I visited Istanbul last week.
 7. Istanbul was Constantinople.
 8. I never made it to Istanbul, but I visited Istanbul last week.
These examples all violate Leibniz’s law regarding the identity of indiscernibles
[2] in various ways, showing that even things that seem identical may not be
so. Another possibility is that a person’s knowledge of X is not an inherent
predicate of X. This is important to the semantic web since, as we discussed
before, all statements made in the semantic web are statements of knowledge,
not statements of truth. Thus, it can be problematic to use owl:sameAs (with
its isomorphic character) in all knowledge representation scenarios.


4     A New Model for Identity in the Semantic Web
The examples above indicate a need for an identity (or similar) notion that is
not inherently isomorphic. We also have a need to be able to find all entities
that are inferred (or stated) to be identical to a designated entity. The modeling
we need to do that is driven by these and other use cases, lead us to desire
more options related to the concept of identity. As mentioned above, Halpin and
Hayes [3] have identified four usages of owl:sameAs in Linked Data and we have
identified several more. We review owl:sameAs and decompose its properties into
Transitivity, Symmetry, and Reflexivity. We then discuss different permutations
of those properties and show how current usages of owl:sameAs fits into the new
framework. Finally, we will show how isomorphism can be selectively enabled
for particular properties using property annotations.

4.1    The Identity Ontology
The property owl:sameAs is, in addition to being isomorphic, Transitive, Sym-
metric, and Reflexive. Each permutation of those meta-properties can be viewed
as a new kind of identity. We have defined a new ontology called the Identity
Ontology (IO).1 The properties of the IO are shown in Table 1. Domain-specific
properties can be created as sub-properties of one of the eight IO properties
in order to maximize interoperability while maintaining distinctions among fu-
ture concepts of identity. We have also defined a mapping ontology that shows
examples of mappings with existing identity properties from RDFS, OWL, and
SKOS2 and show the subproperty relationship among the new and existing iden-
tity properties in Fig. 3.
1
    http://purl.org/twc/ontologies/identity.owl
2
    http://purl.org/twc/ontologies/identity-mapping.owl
6        Towards Identity in Linked Data

                                            Transitive       Intransitive
             Reflexive     Symmetric       id:identical       id:similar
                         Non-Symmetric id:claimsIdentical id:claimsSimilar
          Non-Reflexive Symmetric id:exactlyMatches           id:related
                         Non-Symmetric     id:matches     id:claimsRelated
Table 1. The proposed Identity Ontology. Eight new identity properties derived from
the original meta-properties of owl:sameAs: Reflexivity, Symmetry, and Transitivity.
The prefix “id” is used for the ontology.


Fig. 3. Subproperty relationships between the properties of the identity ontology and
existing identity properties from OWL, RDFS, and SKOS.


id:identical This is the most restrictive property of identity in IO. It follows the
    same definition as owl:sameAs, which “indicates” that two URI references
    actually refer to the same thing: the individuals have the same ‘identity’.”
    [1] As this is the most restrictive property, no IO identity properties are sub-
    properties of it. owl:sameAs is defined to be a subproperty so that existing
    valid assertions of identity are preserved.

The usages “Same Thing As But Different Context” and “Same Thing As But
Referentially Opaque” from Halpin and Hayes [3] fit neatly into id:identical.
“Same Thing As But Referentially Opaque” is effectively supported directly
by use of id:identical, and “Same Thing But Different Context” can be served
by implementation of a subproperty to aid in distinction. The examples using
Superman and Clark Kent can be considered to be in either class, and would
be served equally well. The example using Istanbul3 should be considered to
be “Same Thing But Different Context”, as the contextual distinction is the
existence of the Ottoman Empire. The FOAF examples also would benefit from
use of id:identical.
id:claimsIdentical Since this property is Transitive and Reflexive, but not
    Symmetric, it serves as a way for one entity to claim the identity of another,
3
    Not Constantinople.
                                           Towards Identity in Linked Data       7

    without the inverse needing to be true. As a super property of id:identical,
    everything that is actually identical makes the claim of identity, with both
    sides of the claim being made due to the symmetry of id:identical. This
    property is transitive because if entity a claims to be entity b and b claims
    to be entity c, then a cannot deny that it is claiming to be c as well.

The usage “Represents” can be supported using id:claimsIdentical using a sub-
property “representedBy”. Since id:claimsIdentical suggests that a can be re-
placed by b if claimsIdentical(a, b), then it can be said that b represents a, or
to fit into our usage more clearly, representedBy(a, b).

id:matches This property is reflexive and symmetric. It is inspired by skos:exact-
   Match, which “indicates a high degree of confidence that two concepts can
   be used interchangeably across a wide range of information retrieval appli-
   cations.” [15] id:matches is a super property of id:identical, because for all
   things that are identical, they also match.

For extremely strong assertions of “Very Similar To” from Halpin and Hayes
[3], id:matches can be used to assert identity because id:matches is intransi-
tive. Many current identities in Linked Data would be well supported using this
property.

id:claimsMatches This is the same as id:matches, but is not symmetric, so
    that entities can claim that they match things without reciprocation.

Weaker assertions of “Represents” can be supported using this method. It is
also useful for representing the relationship between a particular biospecimen
and the cell line that that represents it.

id:similar This is a statement of similarity without a guarantee of a complete
    match. Similarity is both Symmetric and Reflexive. Since things that match
    each other are also similar to each other, id:similar is a super property of
    id:matches. This is a super property of id:identical since everything that is
    identical is also similar. It is also a super property of skos:closeMatch [15].

This property can be best used to describe the fact that two biospecimens are
part of the same cell line. A notional subproperty, such as sameCellLineAs, would
allow for a domain-specific distinction of similarity that is understandable to
domain experts while still providing usefulness to more general-purpose systems.
Depending on the strength of “very” in “Very Similar To”, it can also support
the concept of identity for that usage from Halpin and Hayes [3].

id:claimsSimilar This is the same as id:claims but is not symmetric. Enti-
    ties can therefore use this property to claim similarity without recipro-
    cation. A statement of similarity is in actuality two claims of similarity,
    so id:claimsSimilar is a super property of id:similar. In symmetry with
    id:similar, claims of identity and matching imply a claim of similarity.
8        Towards Identity in Linked Data

This property is best seen in cases of asymmetric substitutions. For instance,
decongestant can be substituted by an antihistamine (and can be said to be
similar to it), but when someone has allergies, a decongestant will not relieve
the symptoms. Another example is that, in a pinch, one can use conditioner in
lieu of shaving cream, but the reverse does not hold.

id:related This asserts an associative link between two entities. As it is only
    symmetric, there are no claims to any sort of similarity, matching, or identity.
    Because of this, id:related is a super property of only id:matches, as id:similar
    and id:identical are reflexive, which would make id:related reflexive by proxy.
    This property is closely related to and is a super property of skos:related [15].

The idea of related entities is currently used in SKOS and in OBO (Open
Biomedical Ontologies).4

id:claimsRelated This is the loosest sense of identity in IO. It is a similar
    property to rdfs:seeAlso, which is “used to indicate a resource that might
    provide additional information about the subject resource.” [16] We de-
    fine rdfs:seeAlso to be a sub property of id:claimsRelated. id:related and
    id:claimsMatches are both super properties of id:claimsRelated.

An example subproperty of id:claimsRelated is a depiction. Since a photograph
or a illustration of a person or thing is not the thing itself, but a representation
of the thing, this is a kind of identity that is not symmetric (the photograph is
not depicted by the person), not transitive (a depiction of the depiction may not
depict the original subject), and not reflexive (does a person depict themselves?).
    These properties cover the wide range of identity relationships from “a is the
same thing as b” to “b has more information about a” and allow the expression of
precise concepts of identity while also leaving room for domain-specific concepts
as well.


4.2    Reconstructing Isomorphism

For any reflexive statement of identity, it is possible to recover isomorphic state-
ments using the following SPARQL query snippet:

select ?s, ?p, ?o
where {
    ?s id:identical ?x.
    ?x ?p ?o.
}

A major benefit of this formulation is that any property can be used in place
of id:identical and can be used for domain-specific concepts of identity. Addi-
tionally, property chains in OWL 2 [17] allow the definition of isomorphism for
4
    http://obofoundries.org
                                           Towards Identity in Linked Data        9

specific properties where that behavior is warranted. The specific pattern would
be:
         SubObjectP ropertyOf (ObjectP ropertyChain(identical, p), p)
It is therefore possible to construct properties that are isomorphic across specific
concepts of identity and allows users to query for values for any other prop-
erty that would have been isomorphic if the identity had been asserted using
owl:sameAs.


5   Discussion
We and others [3] have recognized the growing usage of OWL constructs such as
owl:sameAs. However, we also have observed unanticipated usages of owl:sameAs
where the existing semantics do not match the epistemological modeling needs.
This has led us to develop the Identity Ontology. We believe that IO provides
additional representational options for the notions of identity shown in our exam-
ples. We intend to continue our line of research into identifying, describing, and
using these different representational options. It is interesting to note that the
use cases are satisfied through application of existing OWL patterns of property
types and property chaining.
    The Identity Ontology is a starting point for developing a more nuanced
approach to identity in the semantic web. IO addresses numerous challenges in
our biomedical examples, and we have begun to use IO to represent concepts
of identity in biomedical datasets. Specifically, we integrated two experiments
from Array Express: E-TABM-65 and E-MEXP-1029. Both of these experiments
use the NCI-60, a panel of cell lines used for cancer research. We converted
the two experiments to RDF using MAGETAB2RDF5 and aligned the biolog-
ical sources using biomedidentity:sameAsBioSource.6 In that ontology, we also
make mage:has derivative,7 and mged:has biomaterial characteristics 8 isomor-
phic across our identity property using property chains. We plan to continue to
investigate the properties of identity in relation to constructs included in IO as
well as owl:sameAs.


6   Conclusions
We have elaborated on problems with isomorphism in the current use of owl:same-
As in Linked Data on the Semantic Web. We have provided more options for
representing identity using our Identity Ontology and have initial work support-
ing its usage in a number of use cases. We also show how isomorphic statements
can be queried, and how particular properties can be made to be isomorphic
using property chains in OWL 2. We have also successfully used the Identity
5
  http://magetab2rdf.googlecode.com
6
  http://espresso.med.yale.edu/ jpm78/tw/identity/biomedidentity.owl
7
  http://magetab2rdf.googlecode.com/svn/trunk/ontologies/mage-om.owl
8
  http://mged.sourceforge.net/ontologies/MGEDOntology.owl
10      Towards Identity in Linked Data

Ontology to enable more granular control over inference. We have found that
this additional control over inferred information is a better match to our biomed-
ical application needs than what we previously had access to using owl:sameAs
alone.


References
 1. Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L., Patel-
    Schneider, P.F., Stein, L.A.: OWL Web Ontology Language Reference (2004)
 2. Leibniz, G., Loemker, L.: Philosophical papers and letters. Springer (1976)
 3. Halpin, H., Hayes, P.J.: When owl:sameAs isn’t the Same: An Analysis of Identity
    Links on the Semantic Web. In: Workshop on Linked Data on the Web (LDOW).
    (2008)
 4. Jaffri, A., Glaser, H., Millard, I.: URI Disambiguation in the Context of Linked
    Data. In: Workshop on Linked Data on the Web (LDOW). (2008)
 5. Ding, L., Shinavier, J., Finin, T., McGuinness, D.L.: An Empirical Study of
    owl:sameAs Use in Linked Data. In: Web Science 2010. (2010)
 6. Jain, P., Hitzler, P., Yeh, P., Verma, K., Sheth, A.: Linked Data Is Merely More
    Data. Dan Brickley, Vinay K. Chaudhri, Harry Halpin, and Deborah McGuinness:
    Linked Data Meets Artificial Intelligence. Technical Report SS-10-07 (2010) 82–86
 7. Saul, J.: Substitution and simple sentences. Analysis 57(2) (1997) 102
 8. Armstrong, D.: Belief, truth and knowledge. Cambridge University Press London
    (1973)
 9. Gettier, E.: Is justified true belief knowledge? Analysis 23(6) (1963) 121
10. Berners-Lee, T., Hendler, J.: Scientific publishing on the semantic web. Nature
    410 (2001) 1023–1024
11. Heflin, J.: Towards the semantic web: knowledge representation in a dynamic,
    distributed environment. PhD thesis, University of Maryland, College Park (2001)
12. Davies, J., van Harmelen, F., Fensel, D.: Towards the semantic web: ontology-
    driven knowledge management. John Wiley & Sons, Inc. New York, NY, USA
    (2002)
13. Kripke, S.: A puzzle about belief. Meaning and Use (1979) 239–283
14. Pitt, D.: Alter Egos and Their Names. The Journal of Philosophy 98(10) (2001)
    531–552
15. Miles, A., Bechhofer, S.: SKOS Simple Knowledge Organization System Reference
    (2009)
16. Brickley, D., Guha, R.V.: RDF Vocabulary Description Language 1.0: RDF Schema
    (2004)
17. Motik, B., Patel-Schneider, P.F., Cuenca Grau, B.: OWL 2 Web Ontology Lan-
    guage: Direct Semantics (2009)

</pre>