=Paper= {{Paper |id=None |storemode=property |title=Capturing Emerging Relations between Schema Ontologies on the Web of Data |pdfUrl=https://ceur-ws.org/Vol-665/NikolovEtAl_COLD2010.pdf |volume=Vol-665 |dblpUrl=https://dblp.org/rec/conf/semweb/NikolovM10 }} ==Capturing Emerging Relations between Schema Ontologies on the Web of Data== https://ceur-ws.org/Vol-665/NikolovEtAl_COLD2010.pdf
 Capturing Emerging Relations between Schema
        Ontologies on the Web of Data

                          Andriy Nikolov, Enrico Motta

       Knowledge Media Institute, The Open University, Milton Keynes, UK
                       {a.nikolov, e.motta}@open.ac.uk



      Abstract. Semantic heterogeneity caused by the use of different on-
      tologies to describe the same topics represents an obstacle for many
      data integration tasks on the Web of Data, in particular, discovering
      relevant repositories for interlinking and comparing repositories with re-
      spect to the coverage of specific domains. To facilitate these tasks, map-
      pings between schema terms are needed alongside the links between in-
      stances. Currently, explicitly specified schema-level mappings are scarce
      in comparison with instance-level links. However, by analysing existing
      instance-level links it is possible to capture correspondences between
      classes to which these instances belong. In our experiments, we applied
      this approach on a large scale to generate schema-level mappings be-
      tween several Linked Data repositories. The results of these experiments
      provide some interesting insights about the use of ontologies on the Web
      of Data and schema-level relations which emerge from existing data-level
      interlinks.


1   Introduction
One of the main motivations behind large-scale data publishing using the Linked
Data approach [1] is the possibility to integrate relevant information originally
published by different providers. This is achieved, in particular, by establishing
links between instances in different repositories. However, linking a new reposi-
tory to other datasets in the cloud remains a non-trivial task for a data publisher.
In order to tackle this task, several questions have to be answered, in particular:
 – Which other repositories contain relevant data?
 – Which of these repositories should a new repository be connected to? (or,
   alternatively, URIs from which repositories should be reused in a new repos-
   itory? )
In order to answer the first question, one needs to know what types of individuals
are stored in the datasets. With respect to the latter question, the choice of a
candidate third-party repository for establishing links depends on several factors,
in particular, their coverage (are all instances from the new repository mentioned
in a candidate repository?), popularity (which one is the commonly accepted
reference source for a specific type of data?), and level of detail (which repository
describes the most properties for instances of a particular class?).
    These questions can partially be answered with the help of meta-level descrip-
tions using the voiD ontology1 . However, voiD descriptors may be insufficient
to compare some of the characteristics (e.g., whether the domain of food and
diets is better covered in Freebase or DBPedia). Moreover, voiD descriptors not
always describe all relevant properties of datasets (e.g., dcterms:subject is not
always provided) and for some datasets may be not available.
    One of the major obstacles which complicate this kind of analysis is schema
heterogeneity. It can be difficult to establish automatically that two repositories
describe the same kind of data, retrieve relevant data subsets from them, and
make a comparison, if these repositories use different terminology to describe
the same or semantically similar types of instances. For example, a hypothet-
ical repository describing a TV program may need to refer to descriptions of
movies, music pieces, and their performers. There are several repositories avail-
able on the Web: e.g., specific sources describing the music topic (MusicBrainz,
Jamendo, etc.), the movie topic (LinkedMDB), as well as generic sources cov-
ering both (DBPedia, Freebase). In order to compare how well these reposito-
ries are suitable as reference sources, it is useful to know which classes in the
respective ontologies contain overlapping data: e.g., music:MusicArtist and db-
pedia:MusicalArtist, linkedmdb:film and dbpedia:Movie, etc. Having a high-level
overview of schema-level correspondences, which would show the coverage of top-
ics by available ontologies would help the data publisher to make appropriate
choices.
    In this paper, we described our work on constructing such a network of class
level mappings for a subset of the Linked Data cloud. So far, several ontologies
used by popular Linked Data repositories were enriched with mappings con-
necting them to other ontologies (most notably, in the context of the UMBEL
project2 ). However, these mappings, constructed in a top-down way, only cover
a limited subset of the Web of Data and do not fully reflect the structure of the
repository network formed by instance-level links (e.g., such important reposi-
tories as Freebase, RKBExplorer, and LinkedMDB are not covered). Given the
abundance of existing instance-level links, a bottom-up process where the cor-
respondences between classes are captured based on the links between sets of
their instances becomes a promising approach. We applied light-weight instance-
based ontology matching techniques to a snapshot of the Web of Data which
was proposed for the Billion Triple Challenge 2009 competition3 and extracted
a large-scale network of ontology mappings. This network provides interesting
insights into the use of ontologies on the Web of Data and can be employed to
facilitate data integration.
    The rest of the paper is organised as follows: in section 2 we briefly outline
the ontology matching process we used to extract the mappings and discuss
our observations about its applicability and limitations. Then, in section 3 we
describe the resulting network of schema mappings we obtained. In section 4 we
1
  http://semanticweb.org/wiki/VoiD
2
  http://www.umbel.org
3
  http://vmlion25.deri.ie/
overview relevant existing work. Finally, section 5 discusses the limitations of
our work and directions for the future work.


2     Constructing the schema network

The snapshot of the Web of Data which we used in our work was proposed
for the Billion Triple Challenge 2009 competition4 . This is a large-scale dataset
containing about 1.14 billion statements. It contains the core portion of the
repositories published within the Linking Open Data (LOD) initiative, as well
as many smaller datasets retrieved using Semantic Web search engines, such
as Watson and Falcon-S. The LOD datasets included into the BTC repository
such as DBPedia, Freebase, Bio2RDF, RKBExplorer, Geonames, and others still
constitute the core of the Web of Data cloud and are commonly used to connect
other datasets. Thus, their schema ontologies are particularly interesting for
potential data integration scenarios.
    To derive the sets of mappings between these ontologies, we applied a light-
weight matching technique which computes the similarity between a pair of
classes based on the degree of overlap between their instance sets. Originally, we
used this approach to produce schema-level mappings in order to facilitate fur-
ther instance coreference resolution and discover previously missing links [2]. An
advantage of using instance-based ontology matching techniques in the Linked
Data environment lies in their ability to capture interconnections between on-
tologies which emerged from the way they are used by actual repositories rather
than the way they were originally designed.
    When two classes share at least one individual, we say that there is an overlap
relation between these classes. There are two common cases where an individual
becomes assigned to several classes defined in different ontologies:

 – Declared coreference association. In this case, two individuals belonging to
   different repositories are declared to be identical and linked via the owl:sameAs
   property. This creates an overlap relation between the classes to which the
   instances belong.
 – Co-typing. In this case the publishers of a repository structure the data
   using terms of several ontologies. In this way, one individual can be explicitly
   assigned to several classes from different ontologies. One example is DBPedia,
   which uses Yago and UMBEL ontologies in addition to its native DBPedia
   ontology.

These two types of overlaps illustrate different aspects of the data structure.
Declared association-based overlap relations characterise the distribution of data
in different repositories and correspondences between sets of their individuals.
Co-typing-based mappings mostly highlight the choices of data publishers to
use specific vocabularies to annotate their data. To keep this distinction, in
4
    Dataset    statistics  can   be    found   on   http://vmlion25.deri.ie/    and
    http://gromgull.net/blog/category/semantic-web/billion-triple-challenge/.
this paper we analyse the declared association-based and co-typing-based overlap
mappings separately.
    In order to generate all overlap relations present in the dataset, we used the
following procedure:
 1. Extract all rdf:type relations present in the dataset: A(I), where A is a class
    and I is an instance of this class.
 2. For each class A, generate the set of its instances (extension): e(A) =
    {I|A(I)}.
 3. For each pair of classes A and B, generate the co-typing-based overlap set:
    ec A ∩ B = {I|A(I), B(I)}. In total, this constituted about 3.6M co-typing-
    based overlap mappings (we only considered intersections between classes
    which did not share the same URI namespace)
 4. Extract all owl:sameAs relations present in the dataset (sameAs(I1 , I2 )) and
    generate their transitive closure.
 5. Generate association-based overlap sets: ea (A ∩ B) = {I1 |A(I1 ), B(I2 ),
    sameAs(I1 , I2 )} (one sameAs relation corresponds to one element in the
    set). In total, about 1M (992482) association-based overlap mappings were
    produced.
For association-based overlap sets we distinguished between a direct class link
(when their individuals were explicitly stated in the dataset as identical) and
an indirect link (when owl:sameAs relations were inferred using transitivity).
Indirect mappings occurred, in particular, when two repositories were connected
via a third one (e.g., MusicBrainz and Freebase via DBPedia). Both sets of
mappings were filtered to remove general-purpose concepts (such as OWL and
RDFS terms) and blank nodes. These two sets of mappings constitute the “raw
data” which were later analysed to retrieve valid semantic mappings.
    In our original work [2], we used a set similarity-based metrics to discover
relations between “strongly overlapping” classes in the ontologies. We used a
fuzzy notion of “strong overlap” instead of strict subsumption or equivalence
for two main reasons. First, in the Linked Data environment such mappings in
many cases are impossible to derive: sometimes even strong semantic similarity
between concepts does not imply strict equivalence. For instance, the concept
dbpedia:Actor denotes professional actors (both cinema and stage), while the
concept movie:actor in LinkedMDB refers to any person who played a role in
a movie, including participants in documentaries, but excluding stage actors.
Second, such “strong overlap” relations are valuable because they often point to
semantically similar categories which to a large extent share the same instances.
While not always strictly logically correct, these relations are still valuable for
the goals we discussed in section 1: determining and comparing suitable sources
for linking.
    In order to capture the optimal parameters for distinguishing valid semantic
mappings, in the experiments described in this paper we employed a machine
learning approach. To construct a gold standard set, we have randomly selected
a set of 6000 mappings (3000 association-based and 3000 co-typing-based ones)
and annotated them manually (“strong overlap” relations were assigned based on
subjective judgement). In these initial experiments, annotation was done by one
person. After that, we used this gold standard set to train a classification model
which would assign the relation type to any pair of overlapping classes. Our
goal was to find a suitable classifier to distinguish between valid subsumption
and equivalence mappings (owl:equivalentClass and rdfs:subClassOf ) and other
mappings.
   For the classifier, we included the following features:
 – ns1, ns2 : namespaces of two class URIs A and B respectively.
 – |e(A ∩ B)|: the size of the set of instances belonging to both classes A and
   B.
 – |e(A)|, |e(B)|: sizes of instance sets for classes A and B respectively.
 – λ(A, B), λ(B, A), where λ(X, Y ) = |e(X∩Y|e(X)|
                                                   )|

 – direct (only for declared association-based links): a boolean value equal to
   true for direct declared association-based mappings and false otherwise.
To test the resulting model, we used the standard 10-fold cross-validation mech-
anism. After testing, we found that the J48 decision tree algorithm was able
to achieve the best performance (Table 1), so this learned classifier was then
applied to the whole dataset.


                       Table 1. Test results: class matching


              Mapping set       Test Algorithm Precision Recall F1
              Association-based 1    J48       0.939     0.689 0.795
              co-typing-based 2      J48       0.952     0.944 0.948



    The resulting set of mappings was compared against the set of already ex-
isting schema-level relations declared in the dataset. We discovered that the
majority of overlap mappings were not covered by explicitly defined axioms.
Only 3119 mappings (2162 and 957 for the declared association-based and co-
typing-based subsets respectively) were found to be defined as rdfs:subClassOf
and owl:equivalentClass (or could be inferred), which constituted less than 2.6%
and 1.4% of the number of mappings selected by the learned classifier in each
case.


3   Analysing the resulting mappings
We applied the learned decision tree models (J48) to our two sets of mappings
containing declared association-based and co-typing-based overlap mappings. At
the next step, we filtered out redundant mappings: when a class A is found to be a
subclass of two classes B and Bsuper where B v Bsuper and the distance metrics
are equal (λ(A, B) = λ(A, Bsuper )), then only the mapping A v B remains, and
the mapping A v Bsuper is removed. Two resulting sets of mappings were then
used to construct networks connecting classes from different ontologies. The
characteristics of this network are discussed in section 3.1. Then, in order to
study the relations between whole vocabularies, we used the original mappings
between classes to generate a set of mapping-based links between ontologies.
This stage is described in section 3.2.

3.1   Links between classes
We obtained two graphs where classes played the role of nodes and mappings
represented edges. The properties of these resulting networks of classes are given
in Table 2. To give an overview of the most important “hub” nodes in the

                           Table 2. Networks of classes


       Property              Declared association-based Co-typing-based
       Number of nodes       20365                      35578
       Number of edges       82422                      67620
       Maximum number of
                             5301                       18137
       connections per node
       Node with the maximum
                             geonames:Feature           foaf:Person
       number of connections
       Average number of
                             8.09                       3.80
       connections per node


network, Table 3 lists the top 10 classes ranked by the number of connections
they are involved in.
    We can see that the “hub” nodes represent classes representing popular con-
cepts and defined at the high level in the class hierarchy. Large number of
mappings per class is mostly caused by many rdfs:subClassOf relations. After
analysing the distribution of mappings per class, we found that in both cases it
follows the power law and most classes had only one mapping to another class.
    The declared association-based network derived from owl:sameAs links be-
tween instances is more connected: average number of mappings per class is
8.09 compared to 3.8 in the co-typing-based case despite the fact that it con-
tains less nodes. This is possibly caused by the “data-level focus” of the LOD
initiative: the priority for a data repository owner is to generate instance-level
links to other repositories rather than reuse several different vocabularies for
data description. In this case, class-level mappings automatically derived from
owl:sameAs links can be particularly helpful for data integration tasks, because
they add new information which was not explicitly stated in any one repository.
On the other hand, the co-typing-based network illustrates the impact of ontol-
ogy popularity: although the graph has more nodes, it is less connected, and a
single class foaf:Person contributes to more than 25% of all mappings. From the
results we obtained, we can see the strong influence of DBPedia on the result-
ing mappings. In the association-based set, 7 out of the top 10 nodes relate to
                  Table 3. Top 10 classes (by number of edges)


      Rank      Declared association-based           Co-typing-based
      Rank Name                          Edges Name                  Edges
      1    geonames:Feature              5301 foaf:Person            18137
      2    freebase:people.person        2318 umbel:Person           4533
      3    yago:PhysicalEntity100001930 2230 dbpedia:Person          2478
      4    yago:Object100002684          2076 foaf:OnlineAccount     1983
      5    yago:Abstraction100002137     1759 dbpedia:FootballPlayer 1300
      6    yago:Whole100003553           1511 wordnet:Person         1237
      7    linkedmdb:film                1085 dbpedia:Album          996
      8    yago:LivingThing100004258 975 dbpedia:Species             920
      9    yago:Organism100004475        974 dbpedia:Artist          900
      10   yago:CausalAgent100007347 956 dbpedia:MusicalArtist 853



top-level entities from the YAGO ontology. High positions of geonames:Feature
and freebase:people.person are also largely due to the number of DBPedia and
YAGO classes modelling the respective topics. In the co-typing-based network,
we can see the strong presence of the FOAF and WordNet ontologies (largely
due to their high reuse in small-scale datasets even before the start of the LOD
initiative). Beyond that, all top nodes in the network were produced based on
DBPedia instances annotated using different schemas. It is interesting to see the
high position of the class dbpedia:FootballPlayer. The main reason for it is the
large number of YAGO classes (Wikipedia categories) describing this topic.
    When we merged two mapping sets into one, we found that only a small sub-
set of mappings (3591) was shared between two networks. Two types of evidence
we used produced complementary sets of mappings rather than duplicated each
other.

3.2   Mapping-based links between ontologies
In order to capture the relations between different vocabularies used on the
Web of Data, we generated a set of mapping-based links between ontologies.
In accordance with [3], we say that there is a mapping-based link between two
ontologies O1 and O2 if there exists a mapping between classes A and B such
that A ∈ O1 and B ∈ O2 . The classes were assigned to ontologies based on their
URI prefixes, and mappings between classes from the same pair of ontologies
were grouped together. Table 4 contains the details of the resulting graphs, and
Table 5 lists for each case top 10 nodes sorted by the number of edges they are
connected to.
   The graphs constructed using declared association-based and co-typing-based
evidence are shown in Fig. 1 and Fig. 2. In the declared association-based graph
(Fig. 1), the main factor which influences the position of an ontology in the
graph is topic coverage. The top 5 “hub” ontologies with wide coverage do not
have a large difference in the number of connections: YAGO (29), Freebase (28),
                          Table 4. Networks of ontologies


        Property              Declared association-based Co-typing-based
        Number of nodes       52                         743
        Number of edges       172                        1352
        Maximum number of     29                         504
        connections per node
        Node with the maximum YAGO                       FOAF
        number of connections
        Average number of     3.96                       1.85
        connections per node
        Connected             5                          35
        components
        Average path          2.92                       2.48
        length




Fig. 1. The network of ontologies derived from instance coreference links. Ontologies
with wide coverage used by popular repositories serve as “hubs”: YAGO, DBPedia,
OpenCYC, Freebase, and UMBEL.
UMBEL(27), OpenCYC (26), and DBPedia (23). The 6th and the 7th ranking
nodes (LinkedMDB and eurostat), which cover specific domains, have only 13
connections each. It is interesting to note that although Freebase is connected
to less repositories than DBPedia in the LOD cloud5 , this does not have an
impact at the schema level. This is the effect of indirect owl:sameAs mappings
inferred by transitivity. Connections of domain-specific ontologies (such as Mu-
sic ontology or Geonames) point to other ontologies covering the same domain,
and, indirectly, to the underlying repositories which contain relevant data. This
makes them good starting points when the task is to find several datasets rel-
evant to a specific topic. Both networks contain several disjoint subgraphs (5
and 35 respectively), and in both cases the same pattern occurs: there exists one
large central cluster including the majority of nodes and several small ones usu-
ally including a pair of ontologies (e.g., a cluster {http://purl.uniprot.org/core/,
http://bio2rdf.org/ns/uniprot#}). In Fig. 1, similarly to the data-level LOD


                  Table 5. Top 10 ontologies (by number of edges)


             Rank Declared association-based      Co-typing-based
                  Name                 Edges Name               Edges
             1    YAGO                 29    FOAF               504
             2    Freebase             28    Wordnet            296
             3    UMBEL                26    AKT                66
             4    OpenCYC              25    Music ontology     52
             5    DBPedia              23    semantic-mediawiki 37
             6    eurostat (VU Berlin) 13    RSS                30
             7    LinkedMDB            13    eurostat           30
             8    Geonames             12    DAML-OIL           29
             9    openlinksw-demo      12    geneontology       26
             10   FOAF                 11    Mindswap           25



cloud, we can also observe the existence of two “communities” centered around
DBPedia and RKBExplorer. At the schema level these are centered around
YAGO and AKT ontologies. Both communities are connected via the FOAF on-
tology (rdfs:subClassOf relations with the foaf:Person class). At the data level,
RKBExplorer and DBPedia are connected via two other repositories: DBLP
Hannover and DBLP Berlin. The reason for missing schema-level links between
AKT and the ontologies used in DBPedia was the omission of intermediate
owl:sameAs links on this route, which did not allow indirect declared association-
based class mappings to be produced.
    The co-typing-based network (Fig. 2) is substantially larger (746 nodes vs
53) and mainly connects ontologies used outside the LOD cloud (including even
legacy schemas like DAML-OIL). In this graph, the distribution of nodes primar-
5
    http://richard.cyganiak.de/2007/10/lod/
Fig. 2. The network of ontologies derived from ontology reuse (only ontologies with at
least 10 populated classes are shown). FOAF and Wordnet, reused by many datasets,
have the most connections.
ily illustrates ontology popularity: FOAF (504 connections) and Wordnet (296)
get the most connections because they are reused in many datasets.


4     Related Work

Originally, schema matching approaches in the database and Semantic Web do-
mains primarily focused on the task of matching two input schemas in isolation
from others [4], [5]. With the availability of public ontologies, schema matching
methods started to utilise external sources as background knowledge. One ap-
proach proposed in [6] matches two ontologies by linking them to an external
third one. Then, semantic relations defined in this external ontology are used
to infer mappings between entities of two original ones. The SCARLET tool [7]
further elaborates this approach and employs a set of external ontologies, which
it searches and selects using the Watson ontology search server6 .
    Recently, with the growing number of public repositories storing data about
overlapping domains, it became important to analyse the emerging network
of interconnections as a whole. The idMesh system[8] analysed the network of
instance-level owl:sameAs coreference links between semantic repositories with
the aim to identify spurious links and remove them. In [3] the authors used
light-weight matching techniques to create a large set of schema-level mappings
between ontologies from the BioPortal repository describing the medical do-
main. Then, the authors analysed the resulting network to gain insights about
ontological coverage of the domain. We take a similar approach, however, our
primary interest is in schema mappings which emerge from existing data-level
links between repositories.


5     Conclusion and future work

As mentioned in section 1, schema-level mappings can become a valuable asset
for the data publisher who wants to integrate a new repository into the Linked
Data environment: for example, having a new repository about music described
using the Music ontology, the pool of potential data sources to connect to would
include other datasets using the the same ontology, but also repositories which
use ontologies mapped to the it (DBPedia, Freebase, LinkedMDB, etc.). From
this pool the publisher can select the most comprehensive data source for her
needs.
    We consider the work described in this paper as our starting point in studying
the emerging relations between ontologies on the Web of Data. There are several
interesting future directions of research. First, our approach focused on estab-
lishing mappings between classes while ignoring mappings between properties,
which are equally important in data integration scenarios. Mappings between
properties are needed to represent data from different ontologies in a uniform
6
    http://watson.kmi.open.ac.uk/WatsonWUI/
way, which is necessary for applying coreference resolution tools or, in a more
general scenario, to present query results to the user.
   Second, in the context of our intended scenario (assisting the publisher in the
choice of appropriate points of linkage) the quality of mappings had relatively
low importance: a mapping is still useful if it connects two classes with a strong
degree of overlap, but no strict logical relation holds. This allowed us to use very
simple matching techniques to generate schema-level mappings. However, this
assumption does not hold for many actual data integration scenarios: in general,
a precise SPARQL query is not expected to return irrelevant results. Thus, ap-
plying state-of-the-art ontology matching tools to discover high-quality schema
mappings in the Linked Data environment constitutes the second direction for
future work.


6    Acknowledgements
Part of this research has been funded under the EC 7th Framework Programme,
in the context of the SmartProducts project (231204). The authors would like to
thank Paul Groth and Cristophe Gueret for providing the 4store Amazon EC2
community server hosting the BTC dataset.


References
1. Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. International
   Journal on Semantic Web and Information Systems (IJSWIS) 5(3) 1–22
2. Nikolov, A., Uren, V., Motta, E., de Roeck, A.: Overcoming schema heterogeneity
   between linked semantic repositories to improve coreference resolution. In: 4th Asian
   Semantic Web Conference (ASWC 2009), Shanghai, China (2009) 332–346
3. Ghazvinian, A., Noy, N.F., Jonquet, C., Shah, N., Musen, M.A.: What four million
   mappings can tell you about two hundred ontologies. In: 8th International Semantic
   Web Conference (ISWC 2009), Washington DC, USA (2009) 229–242
4. Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE
   Bulletin of the Technical Committee on Data Engineering 23(4) (2000)
5. Euzenat, J., Shvaiko, P.: Ontology matching. Springer-Verlag, Heidelberg (2007)
6. Aleksovski, Z., Klein, M.C.A., ten Kate, W., van Harmelen, F.: Matching unstruc-
   tured vocabularies using a background ontology. In: 15th International Conference
   on Knowledge Engineering and Knowledge Management (EKAW 2006). (2006) 182–
   197
7. Sabou, M., d’Aquin, M., Motta, E.: Exploring the Semantic Web as background
   knowledge for ontology matching. Journal of Data Semantics XI (2008) 156–190
8. Cudré-Mauroux, P., Haghani, P., Jost, M., Aberer, K., de Meer, H.: idMesh: Graph-
   based disambiguation of linked data. In: 18th International World Wide Web Con-
   ference (WWW 2009), Madrid, Spain, ACM (2009) 591–600