<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Knowledge Diversity</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fausto GIUNCHIGLIA</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mattia FUMAGALLI</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering and Computer Science (DISI) University of Trento</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this short paper we introduce two main elements of a general methodology, called iTelos, for the management of knowledge diversity. The first is Knowledge Lotuses, a general purpose tool for the representation of knowledge diversity, while the second is a set of metrics which allow to quantify it, as it occurs within and across knowledge resources. Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1https://lov.linkeddata.es/dataset/lov, http://lov4iot.appspot.com/.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>knowledge diversity</kwd>
        <kwd>knowledge representation</kwd>
        <kwd />
        <kwd>context</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>We usually talk of Semantic Heterogeneity meaning the phenomenon which arises when
multiple knowledge resources, e.g., ontologies or schemas, for the same domain, present
differences in how the intended meaning is represented, most often as a consequence
of the fact that they have been developed independently. Managing this phenomenon is
crucial in order to enable the Semantic Interoperability of knowledge resources. The key
intuition underlying most previous work is to reduce the input representations to a given
reference representation, e.g., an ontology, still preserving the intended meaning. This
problem has been extensively studied in the literature, leading to a substantial amount of
results. As an example, LOV, LOV4IoT,1 three among the most relevant repositories of
reference knowledge resources, collectively contain around 800 such resources, some of
which contain thousands of elements.</p>
      <p>This work has gone a long way with many success stories, in particular in high value,
highly formalized, domains, e.g., health, manufacturing. However, a general solution to
the semantic interoperability problem, applicable with sustainable costs and time effort,
is yet to be found. The difficulties which arise are multifaceted. Some are related to the
fact that different resources only consider different partial aspects of a domain, or that
they represent it at different levels of abstraction and/or approximation. Furthermore,
last but not least, no matter how it is built, any resource will be hardly reusable in novel
contexts and it will most often need to be adapted and evolve in time.</p>
      <p>
        These difficulties are deeply rooted into the nature of knowledge. People adapt their
representations of the world as a function of their goals, focus and many other factors [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
These local representations, though useful, are the key cause of semantic heterogeneity,
this phenomenon being in fact unavoidable. It is simply impossible to construct a finite
representation capable of capturing the infinite richness of the world and also the infinite
ways, provided by language, to describe finitely (some aspect of) the world itself. Thus,
on one hand, for any chosen representation there will always be some aspect of the world
which is not captured and, on the other hand, there will always be an alternative way
to represent the same aspect of the world. This phenomenon is further complicated by
the fact that world diversity and representation diversity are independent, the first being
rooted in the world itself, and the second in how people think about it.
      </p>
      <p>
        In this paper, we propose a novel methodology, called iTelos,2 for the management
of the diversity of knowledge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], as it arises from the combination of the world and
representation diversity, independently of whether it comes from humans or from machines.
(Notice how semantic heterogeneity is just one form of knowledge diversity). iTelos is
based on the following three key intuitions: (i) it should support all phases of the
knowledge life cycle, end-to-end. Notice how iTelos considers, among others, both the
generation of a resource from scratch and its generation from existing resource (the latter being
the focus of the semantic interoperability problem); (ii) It should make explicit the
resources’ representation choices as well as their motivations. The current version of iTelos
considers two such motivations, namely, a set of generalized questions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and, the set
of resources that are reused [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]; (iii) it is crucially based, at the knowledge level, on the
notion of teleology [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], where teleologies are knowledge resources constructed similarly
to ontologies, but by making explicit their underlying representation choices. The key
idea is to use this information as the basis for the (semi-)automatic low-effort reuse of
ontologies.
      </p>
      <p>Within this framework, our goal below is to introduce two new key elements of
iTelos, namely Knowledge Lotuses, as a general tool for representing and visualizing
knowledge diversity (Section 2) and an initial set of metrics which quantify the level of
diversity within and across teleologies (Section 3).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Representing Diversity</title>
      <p>
        Let us assume that a teleology represents knowledge in terms of (types of ) entities (e.g.,
Person, Place, Event), each being associated with a set of properties (e.g., birth-date,
height, near-to, father-of, has-capital), as it is the case in, e.g., knowledge graphs or
relational models. As from Formal Concept Analysis (FCA) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], we formalize teleologies
as contexts, where we define a context C as C = hEC; PC; ICi, with EC = fe1; :::; eng being
the set of entities, PC = f p1; :::; png being the set of properties of C, and IC being IC =
f(e; p) 2 IC j p is a property of eg. IC is a Galois connection [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The set of properties
associated to an entity is called its intention, while we talk of an entity e being in the
domain of a property p, formally dom(p). Thus, for instance, the entity “Person” can be
in the domain of the properties “address” or “name”, while the property “address” may
occur with the entities “Person”, or “Building”. Following the FCA notation, Table 1
reports a set of entities (left) with corresponding properties (top) from Schema.org.rdf
Version 3.5. The value boxes with crosses represent IC.
      </p>
      <p>We represent the diversity which occurs within and across teleologies with
Knowledge Lotuses. Fig 1 provides three knowledge lotuses for (parts of) four
state-of-theart knowledge schemas, namely OpenCyc, (the) DBpedia (ontology), Schema.org, and
2From the Greek word telos, meaning “end, purpose”. The “i” stands for “integration”.
SUMO.3;4 Knowledge lotuses are Venn Diagrams5 defined as follows. Let us assume
that we are interested in analyzing the diversity of certain a set of contexts C, e.g., the
four resources mentioned above. Knowledge lotuses model the three core elements of
contexts, namely, (i) the set of contexts themselves and, for each context, (ii) its set of
entities and, for each entity, (iii) its set of properties. The key intuition is to fix one of
these three elements (namely the context(s), or the entity(ies), or the property(ies)) and
then, under this assumption, to study the diversity of the second element against the third
element. Clearly, several combinations are possible and each of them provides a different
perspective on diversity. As an example, let us consider the three cases in Fig 1.
(a) Schema.org
(b) Organization
(c) All properties</p>
      <p>Lotus (a) fixes the context (Schema.org) and it represents the diversity of entities
in terms of their (un)shared properties. The dual case of comparing properties in terms
of the entities in their domain is also possible. These types of lotuses represent the
diversity internal to a context, in terms of their entities or their properties. Lotus (b) fixes
the entity (Organization) and it represents the diversity of teleologies in terms of their
(un)shared properties (for that entity). The dual case of comparing properties in terms of
the teleologies where they occur is also possible. These types of lotuses represent the
diversity across teleologies, for any given entity. Lotus (c) fixes the properties (all of them)
and it represents the diversity of teleologies in terms of the (un)shared entities. The dual
case of comparing entities in terms of the teleologies where they occur is also possible.
These types of lotuses represent the diversity across teleologies, for any given property.</p>
      <p>For instance, looking at Fig1(a), in Schema.org, “Person” and “Organization” share
30 property terms, while they are distinguished by 31 and 23 terms respectively. Looking
3www.cyc.com, wiki.dbpedia.org, www.schema.org, www.adampease.org.</p>
      <p>4The data in Fig 1, as well as the quantitative analysis below, have been generated via a simple NLP pipeline
which performs the following main steps: a) split a string every time a capital letter is encountered (e.g.,
birthDate ! birth and date); b) lower case all characters; c) filter out stop-words (e.g., hasAuthor ! author).</p>
      <p>5All the lotuses in Fig 1 represent four sets. Simpler/complex lotuses can be depicted to represent the
diversity of lower/ higher numbers of resources.
at Fig 1(b), the four representations of “Organization” share only one property, while two
of them, i.e., OpenCyc and DBpedia share 12 properties.</p>
      <p>
        One observation. Consider Fig 1(c). If one looks at the central part, there are only
four entities which are shared by all resources. These entities are Event, Place, Person
and Organization, namely the entities for time and space, i.e., the two a priori of
perception [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and the two arguably most common types of agenthood. Dually, if one looks at
the fringes, most entities are defined in only one resource (e.g., 2085 in OpenCyc) this
being motivated by the different focus. Thus, for instance, Schema.org is more focused
on information objects, while DBpedia contains information about biological species.
Despite the fact that these four resources are arguably general purpose and that,
therefore, they somehow take a similar view of the world, they present a very low level of
unity (in the shared part) together with a high level of diversity (in the unshared parts).
This is further evidence of the fact that there is no such notion of an observer
independent representation of the world. Analogous argumentations, all providing motivations
for a quantitative study of diversity, could be given also for the other lotuses and for other
knowledge resources.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Quantifying Diversity</title>
      <p>
        We analyze first the diversity within a resource and then across multiple resources. Our
key insight for analyzing a resource internal diversity is based upon Rosch’s cue validity
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This notion was used to define the set of basic level categories, namely those
categories which maximize the number of characteristics shared by their members and
minimize the number of characteristics shared with the members of their sibling categories.
Following Rosch, we define the cue validity of a property p w.r.t to an entity e, also called
cuep-validity, as:
with j X j being the cardinality of the set X and PoE(p, e) being defined as:
Cuep(p; e) =
      </p>
      <p>PoE(p; e)
jdom(p)j
(p, e) returns 0 if p is not associated with e and 1/n, where n is the number of entities in the
domain of p, otherwise. In particular, if p is associated to only one entity its cuep-validity
is maximum and equal to one. Given the notion of cuep-validity we define the notion of
cue validity of an entity, also called cuee-validity, as the sum of the cue validities of the
properties associated with the entity, namely:</p>
      <p>jprop(e)j
Cuee(e) = å Cuep(pi; e) = c 2 [0; prop(e)] (3)</p>
      <p>i=1
where prop(e) is the set of properties which are associated with e. The intuition is the
same as Rosch’s: the entities with higher cuee-validity will be the easiest to recognize.
In fact, the cuee-validity increases with the number of the properties, while decreasing,
for each property, with the number of entities which share that property.</p>
      <p>However, the cuee-validity does not tell us anything about how many properties an
entity shares with other entities in the same context. To make an example, assume that
we have two entities and two properties. Assume the following two situations: (i) both
entities share both properties and (ii) the two entities are each associated to one property.
In both cases the cuee-validity of the two entities is 1 but, while in the first case they
are indistinguishable, in the second case they are highly identifiable. We capture this
distinction via the notion of cueer-validity, as:</p>
      <p>Cueer(e) =</p>
      <p>
        Cuee(e)
jprop(e)j
The higher the cueer-validity the more distinguishable an entity is, for a given value of the
cuee-validity. Thus, for instance, in the first case in the example above, the cueer-validity
of the both entities will be 0.5 while in the second case it will be one. As an example
let us analyze the internal diversity in SUMO and Schema.org as from Fig 2, where the
x-axis and y-axis are cueer and cuee, respectively. A few observations are in order. The
first is that both resources have a few outlier entities (e.g., “GeopoliticalArea”, “Person”
and “CreativeWork”) with higher values of cuee-validity. Furthermore, there seems to be
a pattern by which, the more the cuee-validity decreases, the more entities there are with
the same cueer-validity, this meaning the fact that there is an area with a cloud of entities
which are not easy to distinguish among one another. Furthermore, the outlier entities are
the only ones which high values of cuee-validity. By looking then into the specifics, it
seems that the outliers are mostly those entities of higher interest (e.g., “CreativeWork”
and “Person” in Schema.org) as, maybe, was to be expected (in anything we do, we all
tend to focus on what is of highest interest). This confirms once more the role of diversity
in capturing not only what is formalized (as well in knowledge lotuses) but also the
quality of the formalization (via metrics). We define the basic diversity measure across
resources via the Jaccard index [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Let CA and CB be two contexts, with their sets of
properties, prop(CA) and prop(CB). Then, we define the similarity of two contexts C as
follows:
      </p>
      <p>Simc(CA;CB) = jprop(CA) \ prop(CB)j = c 2 [0; 1]
jprop(CA) [ prop(CB)j
(5)
Simc(CA;CB) is a symmetric measure which tells us how much is (not) shared across two
resources. If this measure is 1 then the two resources coincide, if it is 0 then they are
disjoint. As an example of use, take the two contexts to be a single entity, for instance as
formalized in two different resources. Clearly the value of this resource is independent of
the actual name of the entity itself. This measure will thus allow, for instance, to realize
that the two input entities are the same despite the fact that they have different names
and also the vice versa. Fig 3 below shows some examples of similarity of entities from
SUMO and Schema.org. In the y-axis we have the value of SimC(CA;CB) with CA being
the entity written above the graph as formalized as formalized in either SUMO (SU) or
Schema.org (SC), while in the x-axis we have other entities from both resources (each
entity being taken as CB), in decreasing order of similarity. For instance, “Apartment”
in SC is essentially a synonym of “SingleFamilyResidence” in SC with a similar
situation with “MoveAction”. Notice how the two entities in SC, which are synonyms of
“MoveAction”, are, by transitivity, also synonyms (where it is not clear whether this was
actually what the modeler really wanted to do). The other diagrams report much lower
levels of similarity, where for instance “Person” in SC has the highest similarity with
“Organization” as in SC or SU, namely the other entity for agenthood.</p>
      <p>(a) Apartment
(b) MoveAction
(c) Person</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>We see this as just the beginnings of a quantitative study of knowledge diversity. We see
potential from a scientific point of view, towards the study of knowledge as an emerging
natural phenomenon, but also from an engineering point of view, towards a widespread
reuse of existing resources (teleologies). This work is part of a long term effort aimed
at providing iTelos, a general methodology for managing knowledge diversity and, in
particular, for performing knowledge and data integration in a cost-effective way.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Giunchiglia</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Fumagalli</surname>
          </string-name>
          .
          <article-title>Concepts as (recognition) abilities</article-title>
          . In FOIS,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Giunchiglia</surname>
          </string-name>
          .
          <article-title>Managing diversity in knowledge</article-title>
          .
          <source>In IEA/AIE, page 1</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>U.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giunchiglia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Madalli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Maltese</surname>
          </string-name>
          .
          <article-title>Modeling recipes for online search</article-title>
          .
          <source>In OTM On the Move to Meaningful Internet Systems</source>
          , pages
          <fpage>625</fpage>
          -
          <lpage>642</lpage>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Giunchiglia</surname>
          </string-name>
          . Geoetypes:
          <article-title>Harmonizing diversity in geospatial data (short paper)</article-title>
          .
          <source>In OTM On the Move to Meaningful Internet Systems</source>
          , pages
          <fpage>643</fpage>
          -
          <lpage>653</lpage>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Giunchiglia</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Fumagalli</surname>
          </string-name>
          .
          <article-title>Teleologies: Objects, actions and functions</article-title>
          .
          <source>In ER 2017</source>
          , pages
          <fpage>520</fpage>
          -
          <lpage>534</lpage>
          . Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ganter</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Wille</surname>
          </string-name>
          .
          <source>Formal concept analysis: mathematical foundations</source>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Immanuel</given-names>
            <surname>Kant</surname>
          </string-name>
          .
          <article-title>Critique of pure reason (translated</article-title>
          and edited by p.
          <source>guyer &amp; a. w. wood)</source>
          .
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rosch</surname>
          </string-name>
          .
          <source>Principles of categorization. Concepts: core readings</source>
          ,
          <volume>189</volume>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Real</surname>
          </string-name>
          and
          <string-name>
            <surname>Juan M Vargas.</surname>
          </string-name>
          <article-title>The probabilistic basis of jaccard's index of similarity</article-title>
          .
          <source>Systematic biology</source>
          ,
          <volume>45</volume>
          (
          <issue>3</issue>
          ):
          <fpage>380</fpage>
          -
          <lpage>385</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>