=Paper= {{Paper |id=None |storemode=property |title=Holistic and Scalable Ontology Alignment for Linked Open Data |pdfUrl=https://ceur-ws.org/Vol-937/ldow2012-paper-08.pdf |volume=Vol-937 |dblpUrl=https://dblp.org/rec/conf/www/GrutzeBN12 }} ==Holistic and Scalable Ontology Alignment for Linked Open Data== https://ceur-ws.org/Vol-937/ldow2012-paper-08.pdf
                              Holistic and Scalable
                     Ontology Alignment for Linked Open Data

                  Toni Gruetze                        Christoph Böhm                          Felix Naumann
             Hasso Plattner Institute               Hasso Plattner Institute               Hasso Plattner Institute
              Potsdam, Germany                       Potsdam, Germany                       Potsdam, Germany
           toni.gruetze@hpi.uni-potsdam.de      christoph.boehm@hpi.uni-potsdam.de      felix.naumann@hpi.uni-potsdam.de


ABSTRACT                                                           ples of such LOD applications are showcased on data.gov
The Linked Open Data community continuously releases               and data.gov.uk.
massive amounts of RDF data that shall be used to eas-                However, the LOD vision includes the connection of data
ily create applications that incorporate data from different       from different sources via links to facilitate an easy integra-
sources. Inter-operability across different sources requires       tion. As of September 2011, the LOD cloud comprised 295
links at instance- and at schema-level, thus connecting en-        sources, which all fulfill the basic LOD principles2 . In con-
tities on the one hand and relating concepts on the other          trast, the number and quality of links across these sources
hand. State-of-the-art entity- and ontology-alignment meth-        are an ongoing issue since their discovery is a major chal-
ods produce high quality alignments for two “nicely struc-         lenge [1]. In this paper, we focus on schema-level links,
tured” individual sources, where an identification of relevant     i.e., ontology alignments across data sources; of the 295
and meaningful pairs of ontologies is a precondition. Thus,        data sources 190 use proprietary vocabulary terms. Out
these methods cannot deal with heterogeneous data from             of these 190 sources, only 15 offer mappings to other widely
many sources simultaneously, e.g., data from a linked open         deployed vocabularies; but 159 provide dereferencable URIs
data web crawl.                                                    for proprietary terms, i.e., descriptive information for these
   To this end we propose Holistic Concept Matching                “new terms” are available. Thus, the evolution of the web
(HCM). HCM aligns thousands of concepts from hundreds              of vocabulary terms requires to run ontology alignment ap-
of ontologies (from many sources) simultaneously, while            proaches on (all) pairs of sources without any (or with only
maintaining scalability and leveraging the global view on          few) mappings to other vocabularies.
the entire data cloud. We evaluated our approach against              State-of-the-art ontology matching has been designed to
the OAEI ontology alignment benchmark as well as on the            cope with nicely structured and well defined ontologies in or-
2011 Billion Triple Challenge data and present high preci-         der to produce high-quality mappings for one pair of sources
sion results created in a scalable manner.                         from one specific domain at a time. Instead, in the case of
                                                                   data that stems from the web, we need approaches that can
                                                                   (simultaneously) deal with heterogeneous and incomplete
1.     INTRODUCTION                                                vocabulary definitions from many different sources dealing
   In 2006 Berners-Lee proposed the Linked Open Data               with various topics. Further, the vocabularies from the LOD
(LOD) design principles1 . These principles outline the vi-        cloud as a whole allow a holistic view on the web of vocabu-
sion of a global data cloud that can be used to build novel        lary terms and thus to create alignments depending on other
applications incorporating information about real-world en-        alignments and dependencies. Resulting alignment informa-
tities from many sources. He suggests dereferencable HTTP          tion across many sources can be used for web query answer-
URIs for naming things that return useful information when         ing or the discovery of sources with respect to specific topics.
looked up on the web. Further, he encourages links to other        However, it is a major scalability challenge to deal with very
URIs so that one can discover more information about things        many vocabulary terms gathered from the linked data web.
under consideration. The semantic web community adopted            Therefore, we tackle one the major challenges in ontology
these suggestions and we are now witnessing the growth of a        matching, namely the matching at a very large scale [19].
giant global data cloud comprising information form many           We further add the requirement to be applicable to real-
sources using de-facto standards such as RDF, RDFS, OWL,           world web data from various origins instead of two specific
etc. The pure availability of this data following a set of         sources.
principles is a big win since, in general, the use of (non-           In the following section we briefly review state-of-the-art
linked) open data requires source-specific approaches to ac-       techniques for aligning ontologies and then derive require-
cess, query and filter the data of interest. Instead, when pub-    ments for an approach the deals with heterogeneous vocab-
lished as LOD, one can leverage different sources through          ulary definitions from the web of data (Sec. 1.1). Next,
common mechanisms such as HTTP and SPARQL. Exam-                   we outline our general approach for aligning multiple LOD
                                                                   vocabularies (Sec. 1.2), followed by a technical description
1
    http://www.w3.org/DesignIssues/LinkedData.html                 of respective alignment phases (Sec. 2–4). Along with the
                                                                   technical details of our approach we present measurements
                                                                   to convince the reader of the respective phase’s scalability.
Copyright is held by the author/owner(s).                           2
LDOW2012 , April 16, 2012, Lyon, France                                 http://www4.wiwiss.fu-berlin.de/lodcloud/state/
1.1    State-of-the-art                                                  only. That is, it can neglect instance data due to its
   The matching of data models is an essential task for var-             immense size. Further, property definitions can be ig-
ious areas in the information science, e.g., information inte-           nored since we found that property definitions and their
gration, peer-to-peer data management, Web service compo-                actual usage differs largely.
sition, etc. [8]. Thus, a wide range of approaches were pub-
lished in the area of schema and ontology matching. Current         • Ontology structures should not be used for the ac-
ontology matching approaches often enter the annual Ontol-            tual alignment creation (if available at all). This is for
ogy Alignment Evaluation Initiative (OAEI) [7]. The OAEI              scalability reasons and since the structure across diverse
aims at comparing the performance of different systems by             ontology varies largely and is thus not beneficial. How-
assessing respective strengths and weaknesses. Successful             ever, the approach can exploit structural information,
participants are, for instance, [2–4, 12, 15, 17]. Many state-        such as subclass relationships, to verify derived align-
of-the-art approaches are based on the combination of dif-            ments and find semantic contradictions.
ferent basic matching techniques and require a parameter            • The approach must return equivalence relation-
optimization for each matching task. This configuration is            ships for vocabulary terms. These relationships can
addressed by different meta matching systems [5, 16, 22].             be fuzzy (depending on some parameter) since differ-
However, most of the state-of-the-art approaches have not             ent ontologies have different granularities, e.g., um-
been run on large and heterogeneous ontologies that stem              bel:SpacePlatform Manned ∼ dbpedia:SpaceStation.
from the LOD cloud.
   Nevertheless, the number of ontology matching ap-               Please note that others might make other design decisions;
proaches dedicated to ontologies from the LOD domain has           however, due to the immense scale and a questionable qual-
grown recently. Those LOD approaches commonly use only             ity of current general web data we follow these properties.
one (or few) basic matching techniques. Often, these tech-
niques utilize special RDF and LOD characteristics to im-          1.2       Alignment approach overview
prove the matching result. For instance, Nikolov et al. in-           The aforementioned properties shall hold for an approach
troduce an approach that utilizes owl:sameAs links between         that can be applied to web-scale input data. However, at the
data sets in the LOD cloud to derive concept overlaps [18].        same time we target an holistic view on the data to leverage
   However, there are hundreds of ontologies in the Web of         the information at hand as a whole. Therefore, we propose
Data, which have been created for different use cases but          the Holistic Concept Matching approach (HCM). To address
still contain many overlaps worth discovering. Therefore,          the scalability challenge we group the input data by topic
the ability to perform cross-domain matching is especially         and thus create small groups of concepts that can be aligned
important. Jain et al. introduced an approach that utilizes        locally. Since we group by topics, we can still infer relation-
Wikipedia categories for bootstrapping. This way, they can         ships holistically, i.e., draw conclusions not only based on
cover knowledge from various domains [13, 14]. Further-            a pairwise similarity but additionally based on alignments
more, the LOD ontologies are commonly very large and con-          and dependencies among other topically related members of
tain many instances. Suchanek et al. use instance knowledge        the same group.
to match instance and schema level entities [25].                     Consider for instance five sources (A, B, C, D, E) and
   Recently, Rahm surveyed different approaches to large-          let {a1 , . . . , am , b1 , . . . , bn , c1 , . . . , co , d1 , . . . , dp , e1 , . . . , eq }
scale matching tasks [20]. Besides different basic techniques      be the concepts from these sources. Running a traditional
to reduce runtime and manual configuration effort, he explic-      approach on all available pairs of ontologies would result in
itly emphasized holistic schema matching. Approaches of            up to 52 isolated runs, which is computationally infeasible
this type process a set of input schemata in one run (like [23])   given that alignment approaches often base on complex lex-
and infer knowledge from the entirety of schemata [9, 10, 24].     ical and structural properties [2, 15, 17]. Also, isolated runs
In this paper we we adopt this notion for ontology alignment       may yield low-quality results, since vocabulary definitions
on a large scale.                                                  from the web can be incomplete and highly heterogeneous
   The input for web-scale ontology alignments is a set C of       in terms of granularity and structure of the ontology.
concepts that has been gathered from the LOD cloud via                With the help of additional topic knowledge for the
web access methods. The concepts in C stem from many               sources and their entities we are able to group them.
different ontologies. For each concept in C, there is at least     For instance, given that the three sources (A, B, C)
a URI – its id. Preferably, there are further information,         store media related entities and (D, E) provide in-
such as a label, a description, or a comment. Additionally,        formation from the life sciences, we examine groups
there can be further meta data like structural information.        of concepts, e.g., {a1 , . . . , am , b1 , . . . , bn , c1 , . . . , co } and
For the processing of these large amounts of heterogeneous         {d1 , . . . , dp , e1 , . . . , eq } – yielding fewer runs of the approach.
ontological information we suggest the following properties        Note that these groups must not directly relate to the input
for an alignment approach:                                         ontologies: For instance, if d1 = human, then it could also
                                                                   reside in the first group. However, within these groups, we
• The approach must work with ontologies from diverse
                                                                   can then run computationally more complex approaches to
  domains, i.e, the underlying concept matching strategy
                                                                   take an holistic view based on multiple ontologies.
  must be applicable to many domains.
                                                                      Specifically, our general approach is shown in Fig. 1.
• Due to a large input concept set, the approach must per-         Given the data from the LOD cloud, we extract a knowl-
  form automatic alignment in sub-quadratic run-                   edge representation for all available concepts. This repre-
  time (in the number of concepts).                                sentation can be a simple descriptive string, a feature vector,
                                                                   or a more complex data structure. Next, we apply topical
• The approach should process concept definitions                  grouping in order to create smaller sets of concepts. Within
                 knowledge                      alignment                     (c) In recursion 3 ≤ h determine β’s super-categories
                 extraction      grouping       generation
                                                                                  {γ1 , . . . , γs }.
                                                                              (d) etc.
                                                                        We now explain the three steps on more detail.

                                                                        Step 1. To determine keywords kw(c) for a concept c we
                       concept                                          have tested several methods. In the following, we use two or-
          LOD         knowledge         candidate     consistent        thogonal base methods, namely the TFIDFn -extraktor and
         cloud      representation       groups       alignments        ConceptID-extraktor. The TFIDFn -extractor processes de-
                                                                        scriptions of concepts, i.e., comments and labels. For this,
Figure 1: Scalable and holistic LOD ontology align-                     we merge description texts and tokenize them and neglect
ment                                                                    stop words3 . Then, we determine the tf-idf-score for each
                                                                        token with respect to the overall corpus of concept descrip-
                                                                        tions. Finally, we select the top-n tf-idf-ranked tokens as
these sets, we then create alignments by identifying similar            keywords kw(c).
knowledge representations and reasoning among them using                   The ConceptID-extractor is solely based on the concept’s
additional structural information from the input as well as             URI. The concept id is the unique identifier of a concept
further candidates from the same group.                                 in its ontology definition. HCM uses a straight-forward ap-
   Obviously, many techniques can be plugged into this ap-              proach to determine the concept id: We truncate the prefix
proach. Depending on the knowledge representation, one                  of the URI until the last occurrence of either #, :, or /. The
can choose the specific representation of a concept’s topic,            concept id must not contain any other character than letters
specific topic similarities as well as specific concept similar-        or underscores. To extract tokens, we split the concept id at
ities in order to find alignments. In the remainder of this             camel-case characters and underscores. After again remov-
paper, we report on the adoption of the Wikipedia cate-                 ing stop words, this results in the set of keywords kw(c). For
gory forest [13] for HCM (Sec. 2). The topical grouping is              instance, the concept id for http://umbel.org/umbel/rc/
done using a set similarity index (Sec. 3). For the align-              SpacePlatform_Manned would be SpacePlatform Manned .
ment generation we combine the Wikipedia category forest                From this we create kw(c) = {space, platf orm, manned}.
similarity [14] and a rule-based verification approach [15]                Additionally, HCM supports the ConceptID TFIDFn -
(Sec. 4).                                                               extractor that combines both methods. First, ConceptID
                                                                        keywords are extracted. If no WCF can be constructed
2.     KNOWLEDGE REPRESENTATION                                         (see next steps) HCM tries to build a forest using TFIDFn
                                                                        keywords instead. Thus, the ConceptID TFIDFn -extractor
  The comparison of different concepts requires an abstract
                                                                        maximizes the amount of concepts that can be represented
representation of its semantic content, i.e., a knowledge rep-
                                                                        by an WCF.
resentation. First we present our method to compute this
representation, then we show our performance results.
                                                                        Step 2. Next, HCM performs a Wikipedia full-text search
2.1      Wikipedia Category Forests                                     using a search query that is built by concatenating all key-
                                                                        words kw(c) (delimited by spaces). From the resulting
   Given a set C of concepts from many different ontologies
                                                                        Wikipedia pages we choose the top d pages as tree root for
gathered from the LOD cloud, we now elaborate on the data
                                                                        the creation of a WCF. The parameter d – the forest depth
structure we use to represent a concept. To this end, we
                                                                        – influences a WCF’s coverage of conceptual meanings. To
chose Wikipedia Category Forests (WCF) as proposed for
                                                                        catch all potential meanings of a concept, several Wikipedia
the BLOOMS algorithm in [13, 14], as BLOOMS is a state-
                                                                        articles and their category hierarchies have to be considered,
of-the-art alignment algorithm for matching heterogeneous
                                                                        i.e., it requires a deep forest. On the other hand, a deeper
ontologies from many domains. A WCF is a set of Wikipedia
                                                                        forest has a higher risk to contain trees not related to the
category trees created as follows:
                                                                        actual concept. Therefore, the selection of the forest depth
     1. Given a single concept c ∈ C, create a list of keywords         parameter is a trade-off between semantic coverage and ir-
        kw(c) = {k1 , . . . , kn } from c.                              relevance.

     2. Given kw(c) = {k1 , . . . , kn }, a Wikipedia search for        Step 3. The tree construction builds trees recursively from
        all keywords returns a ranked list of Wikipedia pages           the root to the maximal tree height h. The height mainly
        R = {p1 , . . . , pm }. These pages are roots of the trees      influences the level of abstractness of a WCF. The higher
        in the resulting forest.                                        the trees, the more category hierarchy layers are considered.
                                                                        This thus leads to more abstract categories in the WCF.
     3. For each page p and a height parameter h, construct             Instead, the lower the trees, the less probable is a topical
        a tree for the h top-ranked pages {p1 , . . . , ph } in R, in   overlap among related concepts. We will discuss different
        the following recursive manner:                                 parameter configuration in the experiments section.
                                                                           For instance given resource umbel:MannedSpacecraft, the
         (a) In recursion 1 ≤ h determine p’s categories                ConceptID-extractor yields {manned, spacecraf t}, which
             {α1 , . . . , αq }.                                        3
                                                                         using the English stop word lists of Apache Lucene (v2.3.0)
         (b) In recursion 2 ≤ h determine α’s super-categories          and of the MediaWiki-plugin LuceneSearch (v2.1.3) with 33
             {β1 , . . . , βr }.                                        respectively 121 words
                                                                   ar#cle	
               category	
  
                                                                                                                                                                 Measurements. Given the BTC data where we could ex-
     aerospace	
  engin.	
          vehicles	
  by	
  media	
        space	
  tech.	
        pneuma#cs	
         struct.	
  engin.	
          gas	
  tech.	
     tract 1M concepts, the overall keyword extraction time is
2nd	
  	
  
              spaceflight	
          aerospace	
  engin.	
            spaceflight	
               hydraulics	
      pressure	
             containers	
  
                                                                                                                                                                 7 minutes (Step 1). The average number of keywords per
layer	
  
                                                                                                                                                                 set, i.e., per concept, is 2.77.
1st	
  	
                                                                                                                                                           As expected, the Wikipedia index query time (Step 2) is
layer	
           astronau#cs	
                                   spacecra7	
                                     pressure	
  vessels	
  
                                                                                                                                                                 linear to the number of extracted keyword sets. For instance,
root	
  
node	
                                                                    spacecra7	
                                                                            given 293k keyword sets from the ConceptID TFIDF3 -
                                                                                                                                                                 extractor, querying took 64 minutes.
Figure 2: Tree for the Wikipedia article “Spacecraft”                                                                                                               The input to the keyword set extractors comprises 1M
with h = 2. The 2nd layer categories span two rows                                                                                                               concepts. However, the ConceptID-extractor cannot yield a
for readability.                                                                                                                                                 result for all input concepts, since we do not deal with blank
                                                                                                                                                                 nodes10 and malformed URIs. Further, not all keyword sets
                                                                                                                                                                 created yield a query result; in the end 238k keyword sets
results in the following Wikipedia search result (d = 3):                                                                                                        could be used to build WCFs. Nevertheless, when addition-
{Human spacef light, Spacecraf t, Orion (spacecraf t)}.                                                                                                          ally applying the TFIDF3 -extractor we can further create
Figure 2 depicts the tree for the root article Spacecraf t                                                                                                       another 55k WCFs.
(h = 2). As one can see, the higher the layer, the more                                                                                                             The forest construction runtime mainly depends on forest
nodes (Wikipedia categories) exist and the more abstract                                                                                                         depth d and tree height h. Table 1 shows runtimes for the
are these categories. Furthermore, we note that different                                                                                                        forest construction algorithm with different parameters. Ap-
tree nodes occur multiple times, e.g., “spaceflight” or                                                                                                          parently, the runtime linearly depends on the forest depth d.
“aerospace engineering”. These nodes play an essential role                                                                                                      An increase of the tree height h leads to an exponential run-
in the next phase of HCM (Sec. 3).                                                                                                                               time increase.

2.2               Experiments                                                                                                                                                h\d       5       10       15        20
                                                                                                                                                                             1       0:06h    0:09h    0:11h     0:14h
Setup. All experiments were performed on a Windows                                                                                                                           2       0:15h    0:23h    0:28h     0:38h
2008 R2 Enterprise Server with two quad-core Intel Xeon                                                                                                                      3       0:31h    0:52h    1:18h     1:34h
processors (2.66 GHz) and 30GB of memory. To evaluate                                                                                                                        4       1:13h   3:03h     3:55h     5:45h
the performance of our approach in a web scale scenario, we                                                                                                                  5       4:14h    8:28h   14:01h    15:40h
used the 2011 Billion Triple Challenge (BTC) data4 . The
                                                                                                                                                                 Table 1:     Forest construction runtime using
BTC data is a crawl from the LOD cloud and consists of
                                                                                                                                                                 different tree h and d parameters and the
approximately 2.2 billion triples. We have implemented all
                                                                                                                                                                 ConceptID TFIDF10 -extractor (in hours)
described methods in Java.
   The extraction of concept information can be done in lin-
ear time with Hdrs – a scalable distributed RDF store5                                                                                                             Similar to Jain et al., we set tree height h = 4 and forest
– that allows fast in-order index scans. HCM selects con-                                                                                                        depth d = 10 for all following experiments [13]. For this
cepts by considering resources occurring in triples with dif-                                                                                                    configuration, the WCFs consists of 9.53 trees on average.
ferent predicates: rdf:type, rdfs:subClassOf , rdfs:domain,
rdfs:range, owl:equivalentClass, owl:disjointWith, rdfs:label ,
and rdfs:comment. In this manner HCM identified ap-                                                                                                              3.      CANDIDATE GROUP CREATION
prox. 1M concepts in the BTC.                                                                                                                                       The previous step extracts WCFs, i.e., our knowledge
   In contrast to the original BLOOMS algorithm [13], we                                                                                                         representation of choice. Given these knowledge represen-
cannot use the Wikipedia search service. This is because we                                                                                                      tations, we now need to find topically related concepts in
need to issue very many search requests since we deal thou-                                                                                                      order to group them. Note that in the following we use the
sands of concepts. This vast amount of queries would lead to                                                                                                     terms domain and topic interchangeably. A domain or topic
a high network overhead when querying the search service.                                                                                                        is a field of the real world where different concepts together
Therefore, for HCM, we load a Wikipedia dump6 into a                                                                                                             play a role. The underlying intuition for the following is
full-text index (articles and category hierarchy). The index                                                                                                     that a search for alignments among concepts within (and
was created with Apache Lucene 7 (v2.3.0) and a modified                                                                                                         not across) topics should not cause a major loss of recall.
version of the LuceneSearch 8 MediaWiki-plugin implemen-                                                                                                         Again, we first present our method, followed by an evalua-
tation (v2.1.3). Additionally, we used MongoDB9 (v1.6.5)                                                                                                         tion.
running on the same machine like the HCM implementation.
HCM uses MongoDB to store intermediate results between                                                                                                           3.1       Topical groups
the three phases (Sec. 2, 3, and 4).                                                                                                                               In our approach, the grouping (by topic) shall reduce
4
  http://km.aifb.kit.edu/projects/btc-2011                                                                                                                       the runtime complexity of the alignment generation step by
5
  http://code.google.com/p/hdrs                                                                                                                                  splitting the problem space into smaller independent tasks
6
  We used the English Wikipedia dump of August 3rd, 2011                                                                                                         (Sec. 4). This procedure is often referred to as blocking or
with approx. 11 million pages: http://dumps.wikimedia.                                                                                                           partitioning.
org/enwiki/20110803/                                                                                                                                               Next, we discuss the details of finding topically related
7
  http://lucene.apache.org/java/docs/index.html                                                                                                                  WCFs from the previous step. Given a set F of WCFs, we
8
  http://www.mediawiki.org/wiki/Extension:                                                                                                                       need to identify disjoint groups G1 , . . . , Gn ⊂ F of topically
Lucene-search
9                                                                                                                                                                10
  http://www.mongodb.org/                                                                                                                                             Approx. 50% of all input concept IDs are blank nodes
                                                                                                                            300                                                                                         60
related forests (topic group). In HCM we implemented the




                                                                               number of forests in groups (in thousands)
following procedure:
                                                                                                                            250                                                                                         50




                                                                                                                                                                                                                             number of groups (in thousands)
      1. For each WCF f ∈ F , extract topic(f ) = {t1 , . . . , tm }                                                                    >100k
                                                                                                                                                         ≤100k
         from f . We describe a topic of a WCF using a set of                                                               200                                                                                         40
                                                                                                                                                                     ≤50k
         tree nodes extracted from the WCF.                                                                                                                            ≤5k
                                                                                                                            150                                                                                         30
                                                                                                                                                                        ≤250
      2. Given a topic topic(f1 ), identify all forests f2 ∈ F with
                                                                                                                                                                             ≤25
         a high topical overlap (topic(f1 ) ' topic(f2 )).                                                                  100                                                                                         20
                                                                                                                                                                               ≤5

      3. Given a set of pairs (f1 , f2 ) with a high topical overlap,
                                                                                                                             50                                                                                         10
         determine groups G of forests by associating topically                                                                                                                     =2

         overlapping forests transitively.
                                                                                                                              0                                                                                         0
                                                                                                                                  0.3      0.4          0.5           0.6            0.7          0.8         0.9   1
                                                                                                                                                              topic similarity threshold (θJ)
Step 1. The topic topic(f ) of a WCF f is a subset of its                                                                                        Number of forests in groups             Number of forest groups

tree nodes ({p, α1 , . . . , αq , β1 , . . . , βr , . . . }). In the process
of finding characteristic nodes that represent this concept’s                  Figure 3: The forest group distribution for h = 4,
topic, we ignore very specific categories like “Soviet manned                  d = 10, the ConceptID TFIDF keyword extractor,
space program” and very generic categories like “Humans” or                    and the TFIDFForest10 topic extractor for varying
“Wikipedia article lists”. In this paper we report on a simple                 θJ . The green graph shows the number of forests
yet promising approach we have tested (among others) for                       that can be grouped (left axis). The shaded areas
the topic extraction: The TFIDFForestn -extractor again uti-                   below indicate the number of forest in groups of a
lizes the tf-idf measure. Nodes common in a WCF are scored                     specific size. The dashed red line shows the total
with a high term-frequency-value (tf), whereas nodes popu-                     number of groups (right axis).
lar in many WCFs are scored with a low inverse-document-
frequency-value (idf). We rank a WCF’s tree nodes in de-
scending tf-idf order and select the top-n for representing the
topic. The choice of n is crucial for the quality of the tree                  the number of forests that can be grouped for a given θJ
node set representing the concepts’s topics. Experiments                       (green graph, left axis). Shaded areas below the green graph
indicated n = 10 to be a reasonable choice. Higher values                      depict the number of forests in groups of a given size, e.g.,
lead to very specific topics, whereas smaller n lead to a low                  size = 2, size ≤ 5, size ≤ 25, etc. Also, we show the total
representativity.                                                              number of groups over θJ (dashed red graph, right axis).
                                                                                  As for the fraction of WCFs that can be grouped, a higher
Step 2. Next, we compare the topic sets to identify related                    θJ leads to a decrease since the higher θJ , the fewer WCF
forests. We use the Jaccard coefficient J and a correspond-                    pairs have a sufficient overlap to be grouped. The shaded
ing threshold θJ to determine the similarity of topic sets                     areas further show that the higher θJ , the more WCFs fall in
(J(topic(f1 ), topic(f2 )) ≥ θJ ). The naı̈ve pairwise compari-                small groups, since a stricter overlap criterion induces more
son of all forest topic sets leads to a quadratic runtime be-                  individual groups. Vice versa, a lower θJ results in larger
havior with respect to the number of forests |F |. To re-                      groups since smaller groups are merged in this case.
duce this effort we tested different set-based similarity self                    As for the total number of groups for a specific Jaccard
join techniques and selected ppjoin [27], which delivered the                  threshold, there is a clear increase for higher θJ . This is due
most promising results (see Sec. 3.2). The ppjoin is a set                     to the fact that there are more smaller groups for higher θJ .
similarity self join technique that reduces the amount of ob-                  The figure shows that θJ = 1.0 leads to 138k (out of 293k)
ject pairs to be compared by using an inverted index and                       forests have at least one correspondence with an identical
two filtering techniques on the candidate sets (prefix and                     topic set. The majority (92k) appears in groups of size two.
positional filtering). Our implementation is an adapted ver-                      For the following experiments we set θJ = 0.7 as default
sion of an open source implementation11 . The application                      for three reasons: (1) This avoids very large WCF groups
of a set-based similarity self join technique enhances perfor-                 which would raise the runtime of the alignment generation
mance and still keeps the topical cohesion of the group.                       phase. With θJ = 0.7, the maximal group size of 4k leads to
                                                                               a reasonable (and feasible) runtime. (2) θJ = 0.7 provides
                                                                               a significant topical overlap among forests and minimizes
Step 3. To determine groups of topically related WCFs,                         haphazard correspondences. (3) This threshold value en-
we build a graph containing forest nodes F and edges be-
                                                                               ables the identification of groups for more than 55% of the
tween topically related forests (J(topic(f1 ), topic(f2 )) ≥ θJ ).
                                                                               available WCFs (162k). Remaining WCFs do not have a
Next, we use a depth-first search to determine connected
                                                                               sufficient topic overlap and can thus not be considered for
components within the graph [11]. Then, all forests in a
                                                                               the alignment generation.
component are either directly or indirectly connected to each
                                                                                  In general, the runtime of the candidate group generation
other, but have no connection to forests in other compo-
                                                                               depends on the number of input WCFs, the applied self-
nents.
                                                                               join technique, and topic set homogeneity (within the input
3.2       Experiments                                                          set). Figure 4 compares the runtime of three different set
                                                                               similarity join techniques with groups for TFIDFForest10
  Figure 3 depicts the WCF group distribution in the BTC
                                                                               topics, θJ = 0.7, and varying number of WCFs. The red
data over the similarity threshold θJ . For one, we illustrate
                                                                               graph (triangle markers) shows the quadratic runtime of the
11
     https://code.google.com/p/similarity-join-tools/                          naı̈ve baseline approach that compares all forests in a pair-
                    3
                                                                                                date alignments. Finally, extract alignments A from
                                                                                                the conflict-free alignment graph D.


                    2                                                                     Step 1. First, we extend each topic group G with further
run-time in hours




                                                                                          related WCFs in order to incorporate immutable axioms
                                                                                 naive    from the underlying ontologies. Specifically, related WCFs
                                                                                 mpjoin   f2 have an ontology-level relation in common with a WCF
                                                                                 ppjoin
                    1                                                                     f1 ∈ G, i.e., owl:equivalentClass and owl:disjointWith rela-
                                                                                          tionships. We did not select other relations since we deem
                                                                                          these two as most relevant to reveal semantic conflicts of
                                                                                          alignments while keeping the group size low (which is a de-
                    0                                                                     sirable property). In the following, the topic group G refers
                        0      50   100           150          200   250   300            to the extended topic group.
                                    forest set size (in thousands)

                                                                                          Step 2. Next, we compare all pairs of WCFs f1 , f2 ∈ G
Figure 4: The forest grouping run-time for different                                      to determine the overlap of the forests. HCM uses the
similarity join algorithms. The experiment was ex-                                        tree overlap similarity of BLOOMS+ to compare individ-
ecuted for θJ = 0.7 using the TFIDFForest10 topic                                         ual trees [14]. The tree overlap aggregates common nodes
extractor and h = 4 and d = 10.                                                           between two trees as follows: Given two arbitrary trees t1 , t2
                                                                                          from different WCFs, the tree overlap is defined as:
wise manner. The yellow line (rhomb markers) indicates                                                                                        −1
                                                                                                                                                   
                                                                                                                  log n∈(t1 ∩t2 ) 1 + ed(n,t1 ) −1
                                                                                                                     P
the performance of the mpjoin introduced by Ribeiro and
Härder [21]. In contrast to the ppjoin, mpjoin tries to min-                                 Overlap(t1 , t2 ) =
                                                                                                                              log 2|t1 |
imize the effort for the candidate set determination instead
of minimizing the candidate set size. The green line (rectan-                             Thus, the tree overlap depends on the distance d of common
gle markers) illustrates the runtime of the ppjoin algorithm.                             nodes n ∈ (t1 ∩t2 ) to the root of t. The higher the node depth
The ppjoin algorithm performs best and has a nearly linear                                in t (d(n, t)), i.e., the more nodes are between n and the root
runtime. Here, the similarity join of 295k forest topics takes                            page p of t, the smaller the influence of the shared node. The
only 6 minutes. Remember that these numbers have been                                     depth of the root page is set to d(p, t) = 1. For instance,
created with a very large and heterogeneous sample from the                               given the tree t1 with the root p = “spacecraft” (article) of
current Web of Data – the BTC. Thus, we can demonstrate                                   our previous example (see Fig. 2) and an arbitrary tree t2 .
a reasonable runtime for current very large web data sets.                                Assuming two overlapping nodes x = “astronautics” and y =
What is more, Wang et al. introduce a MapReduce imple-                                    “aerospace engin.” between t1 and t2 such that t1 ∩ t2 =
mentation of ppjoin [26], which leads to the conclusion that                              {x, y}, the resulting depths of the nodes in t are d(x, t) = 1
our approach can also be run on even larger future datasets.                              and d(y, t) = 2. Therefore, the resulting tree overlap of both
                                                                                          trees in t1 is:
                                                                                                                                       1
                                                                                                                                            
                                                                                                                   log 1 + e0 + 1 + e− 2
4.                          ALIGNMENT GENERATION                                             Overlap(t1 , t2 ) =                                ≈ 0.385
                                                                                                                           log(2 × 14)
  Given the set of candidate groups the next task is to re-
gard each group and find alignments among their respective                                The tree overlap is not symmetric (Overlap(t1 , t2 ) 6=
members independently. This leads to runtime reduction                                    Overlap(t2 , t1 )). With this asymmetry BLOOMS+ is able
decreased due to the reduced input set size (|G|  |F |) and                              to identify parent-child relationships between concepts. The
the possible parallelization of alignment generation task for                             equivalence between concepts is derived if Overlap(t1 , t2 ) =
different groups In the following, we elaborate on our un-                                Overlap(t2 , t1 ). However, we are interested in a similarity
derlying matching strategy to identify relations among con-                               value for two WCFs instead of individual trees. Therefore,
cepts, followed by an experimental evaluation.                                            we compare all pairs of trees of different WCFs and select
                                                                                          the best matching tree pair. Given two WCFs f1 and f2
4.1                         WCF alignments                                                (f1 , f2 ∈ G; f1 6= f2 ), we define the forest similarity as the
   Given a set G of topically related WCFs, we need to iden-                              maximal harmonic mean for the overlaps of all tree pairs
tify a set of alignments A. We propose a procedure that,                                  (∀t1 ∈ f1 , t2 ∈ f2 ):
again, consists of three steps:                                                                                                                               
                                                                                                                         2Overlap(t1 , t2 )Overlap(t2 , t1 )
                    1. Extend G by adding related WCFs. These relations                    O (f1 , f2 ) = arg max
                                                                                                              t1 ∈f1 ,   Overlap(t1 , t2 ) + Overlap(t2 , t1 )
                                                                                                              t2 ∈f2
                       originate from the original ontology definitions.
                    2. Compare all forest pairs f1 , f2 ∈ G using a forest over-             Then, we select relevant matches M by using an overlap
                       lap score function O. Extract all forest pairs with an             threshold O(f1 , f2 ) ≥ θO . The selection of a good threshold
                       overlap value exceeding a given threshold O(f1 , f2 ) ≥            is essential for the alignment quality, because a low θO -value
                       θO ; add them to match candidate set M .                           leads to a high amount of irrelevant candidates and thus a
                                                                                          lower precision. The following semantic deduction might
                    3. Create an alignment graph D by iteratively adding                  not be able to eliminate erroneous candidates. In addition,
                       candidate matches M that neither semantically con-                 the run-time complexity in the semantic verification phase
                       flict with ontology definitions, nor with other candi-             grows. A high θO -value instead leads to fewer candidates
                                                                                                  f3




                                                                                                                         x                      x
and thus a low recall. In Sec. 4.2 we evaluate the different                                       equivalence           disjointness                   parentship
thresholds and their influence on the alignment quality.
                                                                              ex                        ex                       ex                                ex                       ex
                                                                        f1              f2     f1 x           x f2        f1               f2                f1              f2       f1              f2
                                                                                                                x          x               x
Step 3. Next, HCM performs a logical reasoning to remove               a
                                                                                   a'
                                                                                              a
                                                                                                             a'
                                                                                                                         a
                                                                                                                           x
                                                                                                                                      a'
                                                                                                                                                             a
                                                                                                                                                                        a'
                                                                                                                                                                                      a
                                                                                                                                                                                                 a'
conflicting matches from candidate sets and thus enhance                f3                     f3 x                       f3 x                          f1 f3                         f3
                                                                                                                                                                         f2
the precision. HCM draws conclusions for matches contra-                     (a)                       (b)                   (c)                        f3        (d)                      (e)
dicting each other. This is done using the holistic knowledge
of all ontologies. Given a ranked set of forest match candi-        Figure 5:x Five             x   types  x        ofx alignment   x             x inferences,
                                                                                                                                                             x             x    that  x            x
dates M , the semantic verification will produce an alignment       infer a closure a0 from a new alignment a and an ex-
                                                                       equivalence disjointness
                                                                                             equivalence
                                                                                                  parentshipdisjointness
                                                                                                                      equivalence
                                                                                                                           parentshipdisjointness
                                                                                                                                               equivalence
                                                                                                                                                    parentshipdisjointness
                                                                                                                                                                        equivalence
                                                                                                                                                                             parentshipdisjointness parentsh

graph D = (G, E) with a set of WCF nodes G and a set of di-         isting edge ex in the matching graph. (a) shows an
rected edges E ⊂ G×G representing the verified alignments.          equivalence closure                        for type(e          x ) = type(a)                 = equif . (b)
                                                                                                  f                        f                        f                                                f
The edges are annotated with three different attributes:            and (c) show disjoint
                                                                                                        1    f       2
                                                                                                                    closures
                                                                                                                                 1    f
                                                                                                                                         for type(e
                                                                                                                                                    2          f   1

                                                                                                                                                               x ), type(a)
                                                                                                                                                                                  2
                                                                                                                                                                                      ∈
                                                                                                                                                                                        f   1
                                                                                                                                                                                                           2
                                                                                                                                                                                                               1

                                                                                                  f     3                  f     3                  f              3         f              3        f         3

                                                                    {equi , disj }; type(ex ) 6= type(a). (d) and (e) show par-
• The certainty of an alignment cert : G × G → R10 spec-
                                                                    entship and childship closures for type(ex ), type(a) ∈
  ifies the confidence of a matching. The higher the value,
                                                                    {parent, child }; type(ex ) = type(a).
  the more reliable the match.

• The alignment type classifies every matching pair in              HCM checks alignments for multiple-entity correspon-
  one of five categories type : G × G → {equi, disj , parent,       dences, crisscross correspondences, and disjointness-
  child , onto}. An equi edge connects two concepts iden-           subsumption contradictions. These three types of con-
  tified to be equal, whereas a disj edge marks two con-            tradictions were originally introduced for the ASMOV on-
  cepts to be disjoint. A parent or child edge is a directed        tology alignment approach by Jean-Mary et al. [15].
  subset information representing an rdfs:subClassOf re-               If and only if no contradiction was found for the can-
  lationship. An onto labeled edge marks two concepts to            didate alignment a, HCM adds the alignment candidate
  originate from the same source ontology.                          and its inverse a−1 to the graph’s edge set E. The inverse
                                                                    alignment a−1 = (f2 , f1 ) of a = (f1 , f2 ) has got an analog
• The alignment origin provides information about the               alignment identification (origin(a−1 ) = origin(a)) and cer-
  edge’s origin, origin : G × G → {def , detect, infer }.           tainty (cert(a−1 ) = cert(a)) and the inverse adapted type
  A def edge is an axiom stated in the source ontology.             (type(a−1 ) = type−1 (a)):
  A detect alignment pair is a verified concept alignment
                                                                                             equi    if type(x) = equi
                                                                                         
  pair, i.e., O(f1 , f2 ) ≥ θO . An infer edge is a inference                            
                                                                                          disj      if type(x) = disj
                                                                                         
  drawn from other matches with the help of transitive                                   
  closures.                                                                 type−1 (x) =     parent if type(x) = child
                                                                                          child     if type(x) = parent
                                                                                         
                                                                                         
                                                                                         
To populate the graph D, HCM performs the following:                                         onto    if type(x) = onto

(a) Initialize the alignment graph D with no edges (E = ∅).            Second, HCM infers further alignments from a (a−1 re-
                                                                    spectively) by drawing conclusions with the help of exist-
(b) Populate D with ontology constraint edges a (cert(a) =          ing alignments e ∈ E. HCM supports five types of infer-
    1.0 and origin(a) = def ) derived from rdfs:subClassOf ,        ence types shown in Figure 5: Given an existing alignments
    owl:equivalentClass, and owl:disjointWith triples from          e = (f1 , f2 ) e ∈ E and the alignment just added a = (f2 , f3 )
    the underlying RDF data.                                        (respectively a−1 = (f2 , f3 )), a closure a0 = (f1 , f3 ) can be
                                                                    drawn.
(c) Add relations a between WCFs originating from the                  Third, given the verified alignment graphs, HCM outputs
    same ontology with type(a) = onto, cert(a) = 1.0, and           all edges e ∈ E with origin(e) = detect as alignment set A.
    origin(a) = def . Here, we use the ontology id from a              For instance,       given a WCF group G                     =
    concept URI by removing the ConceptID (see Sec. 2,              {dbpedia:SpaceStation, umbel:SpacePlatform Manned ,
    Step 1).                                                        dbpedia:Spacecraft} the population of graph D works as
                                                                    follows: The only ontology constraint for G is the alignment
(d) Finally, match candidates M (Step 2) are selected greed-        aonto = (dbpedia:Spacecraft, dbpedia:SpaceStation), with
    ily: We add matches a = (f1 , f2 ) from M with decreas-         cert(aonto ) = 1, origin(aonto ) = def, and type(aonto ) = onto
    ing similarity O(f1 , f2 ) to E. New edges in E comprise        in order to indicate that both concepts to originate from
    cert(a) = O(f1 , f2 ), type(a) = equi, and origin(a) =          the same ontology. Both alignments aonto and the in-
    infer as annotations.                                           verse a−1onto are added to D.       Afterwards the candidate
                                                                    set G = {a1 , a2 , a3 } is processed. Assuming alignment
When inserting an alignment – ontology constraints or can-          certainties of:
didate alignments (Step 2) – we perform three steps:                a1 = (dbpedia:SpaceStation, umbel:SpacePlatform Manned )
   First, we check whether an alignment a contradicts the ex-       cert(a1 ) = 0.955,
isting edges in D. To this end, HCM checks for four different       a2 = (dbpedia:Spacecraft, dbpedia:SpaceStation)
types of contradictions: Conflicting definitions are align-         cert(a2 ) = 0.718, and
ments in the graph that are incompatible to an new align-           a3 = (dbpedia:Spacecraft, umbel:SpacePlatform Manned )
ment a = (f1 , f2 ). A conflicting definition occurs, if there is   cert(a3 ) = 0.341
an edge ex = (f1 , f2 ); ex ∈ E such that type(ex ) 6= type(a),     HCM proceeds as follows: First a1 is added to D with
origin(ex ) 6= infer , or cert(ex ) ≥ cert(a). Furthermore,         origin(a1 ) = detect and type(a1 ) = equi (and so a−1   1 ). Af-
                                                                                     36
terwards, a2 is not added, due to the conflicting definition
aonto , which marks both concepts to originate from the
                                                                                     30
same ontology. Finally, due to a multiple-entity correspon-
dence a3 is prevented too: umbel:SpacePlatform Manned is                             24




                                                                  runtime in hours
already matched to a concept from the dbpedia-ontology.
                                                                                     18
4.2    Experiments
  In the following section, we discuss the experimental re-                          12
sults of HCM for the BTC data as well as on the Ontology
Alignment Evaluation Initiative benchmark track [6, 7].                              6



BTC. To determine HCM’s alignment precision for the                                  0
                                                                                          0   500   1000      1500      2000       2500    3000   3500   4000
BTC data, we executed the alignment generation algorithm                                                     number of forests per group
using different threshold settings (θO ). We chose seven                                                     runtime     Poly. (runtime)
ranges: simf orest = 1, 1 > simf orest ≥ 0.95, 0.95 >
simf orest ≥ 0.9, . . . , 0.75 > simf orest ≥ 0.7. We did not     Figure 6: Alignment generation execution time for
consider alignments with O(f1 , f2 ) < 0.7 which is along the     different WCF group sizes. The blue dotted line in-
lines with the argument by Jain et al. for the BLOOMS+’s          dicates the quadratic regression line unifying the ac-
tree overlap measure [14]. The authors mention an optimal         tual alignment generation time per group (indicated
threshold of 0.85. After determining alignments using HCM,        by a yellow rhomb). The experiment was executed
a random sample of 50 alignments for each similarity range        with h = 4, d = 10, the ConceptID TFIDF3 keyword
was selected. Then three independent annotators manually          extraction method, the TFIDFForest10 topic extrac-
inspected these alignments and identified the actual relation     tor (θJ = 0.7), and the forest threshold θO = 0.7.
between aligned concepts in a majority vote process. We
used three different types of relations: Equivalence indi-
cates a correct alignment of two equivalent concepts. Sim-        semantic relation via outlaw motorcycle clubs similar to Ri-
ilar indicates a fuzzy match. The two concepts are related        vals Hells Angels and Bandidos.
and the set of instances of the concepts have a significant in-      Runtimes of the alignment generation phase is visual-
tersection. Disjoint indicates an incorrect alignment of two      ized in Figure 6 for different WCF groups (orange rhomb
concepts. The set of instances of the concepts are disjoint.      marker). Due to the pairwise comparison of trees of all
   Table 2 shows our manual evaluation results as well as         WCFs in a group G to generate candidate alignments (see
their confidence intervals (confidence level of 95%). Addi-       Step 2), the alignment generation phase has a quadratic run-
tionally, it depicts absolute numbers of alignments found         time in the group size |G|. Furthermore, the effort alignment
in the BTC data for varying thresholds. In the case of            verification is polynomial to the amount of alignment can-
1 > simf orest ≥ 0.95, 57.5% of the alignments were la-           didates. The blue dotted line indicates the polynomial re-
beled as equivalences and 15% as similar. The number of           gression of the forest match execution times. The amount of
actual equivalences decreases for smaller forest similarities.    small groups is very high (see Figure 3), which is no prob-
When considering only the strict equivalence judgments as         lem, because of small run-times. A remaining issue is the
relevant alignment, the inter-annotator agreement for strict      large effort of the matching step for large groups which is
equivalences is very high κ = 0.909 (Cohen’s kappa coef-          left for future work.
ficient). By treating more fuzzy results (similar or equal
alignments) as correct, the inter-annotator agreement de-
creases to κ = 0.817. This is due to a varying perception                     O(f1 , f2 )                     P (eq)             P (eq ∪ sim) align.
of a similarity or relatedness. It is much more subjective                        O = 1.0                  0.964 ±0.036          0.964 ±0.036     601
than a strict equivalence definition. For instance, the con-                1.0 > O ≥ 0.95                 0.575 ±0.143          0.725 ±0.129 133.317
cepts daml:Ammeters and umbel:Voltmeter are related due                    0.95 > O ≥ 0.9                  0.481 ±0.145          0.706 ±0.131  72.131
to their common purpose. On the other hand, from a tax-                     0.9 > O ≥ 0.85                 0.071 ±0.066          0.294 ±0.131   6.921
onomic point of view, the instance sets of both concepts                   0.85 > O ≥ 0.8                  0.053 ±0.053          0.200 ±0.114   4.279
are probably disjoint. HCM discovered interesting matches                   0.8 > O ≥ 0.75                 0.053 ±0.053          0.126 ±0.092   3.139
such as:                                                                   0.75 > O ≥ 0.7                  0.071 ±0.066          0.331 ±0.136   1.950
    • yago:PsychoactiveFungi and
      umbel:HallucinogenicMushroom                                Table 2: The probabilities and confidence intervals
    • dbpedia:BirdsOfAustralia and opencyc:Parrot                 for the 50 random samples of the alignments pro-
    • umbel:AirConditioner and dbpedia:HeatExchangers             posed by HCM (confidence level 95%). The left-
Note that, due to diverse and non-trivial domains in the          most column shows the different similarity classes
BTC corpus, these kind of matches are hard to identify, even      (O(f1 , f2 )). The 2nd column from the left presents
for humans. Our algorithm can deal with these cases because       the probability and confidence interval for an align-
it exploits broad knowledge for many domains present in           ment representing an equivalence. The 3rd column
Wikipedia.                                                        from the left shows the probability and confidence
   Furthermore, it is interesting to see that even in case of     interval for an alignment representing either equiv-
false positives, the algorithm reveals interesting semantic re-   alence or similarity. The rightmost column depicts
lations between concepts. For instance, HCM aligns the            the total number of identified alignments within the
concepts yago:Outlaws and umbel:MotorcycleClub due to a           BTC data.
OAEI. In order to compare with other ontology alignment          lenge here, is to identify approaches that can capture differ-
approaches (e.g. [2, 13, 15, 17]), we provide results on the     ent semantic notions of an input concept and allow a useful
benchmark track of the Ontology Alignment Evaluation Ini-        grouping. Also, we plan to derive further relationship types
tiative Campaign [6]. The goal of the track is to evaluate       - maybe with the help of instance data instead of Wikipedia
strengths and weaknesses of different alignment approaches       categories. Last, we plan to experiment with other topical
by providing a set of alignments between a reference and sys-    groupings that, e.g., allow overlaps across groups. Another
tematically defamiliarized ontologies. All considered ontolo-    grouping method could also support incremental updates
gies in the benchmark track originate from the bibliographic     and thus facilitate online ontology alignments for the grow-
domain. Since our algorithm is designed to match concepts        ing web of linked data.
from various ontologies at once, we generate alignments for
concepts from ontologies in the benchmark track. To deter-       References
mine the matching quality we extract all concept pairs be-
tween ontologies having a reference alignment. HCM treats         [1] C. Bizer, T. Heath, and T. Berners-Lee. Linked Data -
concepts with equivalent URI as one entity. Therefore we              The Story So Far. International Journal on Semantic
implicitly infer alignments between concepts with an equiv-           Web and Information Systems, 4(2):1–22, 2009.
alent URIs. Furthermore, the reference alignments for prop-
                                                                  [2] I. F. Cruz, F. P. Antonelli, and C. Stroe. Agreement-
erties are neglected, because HCM does not target property
                                                                      Maker: Efficient matching for large real-world schemas
alignments. Using a WCF overlap threshold of θO = 0.7
                                                                      and ontologies. In Proceedings of the VLDB Endow-
leads to a precision of 85% and a recall of 55%. In com-
                                                                      ment, pages 1586–1589, 2009.
parison to other approaches that additionally offered align-
ments between properties, these numbers are average. This         [3] J. David, F. Guillet, and H. Briand. Matching directo-
is mainly due to the changes of ConceptIDs and descriptions           ries and OWL ontologies with AROMA. In Proceed-
in the automatically generated ontologies. For instance, the          ings of the International Conference on Information
concept xsqlknk of ontology 266 neithers contain a human              and Knowledge Management (CIKM), pages 830–831,
readable ConceptID, nor a description. Nevertheless, the              2006.
reference alignments contain a match with 101:MastersThe-
sis. Many state-of-the-art ontology matching algorithms           [4] H. H. Do and E. Rahm. Coma - a system for flexible
solve this problem by using an structure based similarity             combination of schema matching approaches. In VLDB,
as sole basis of decision making. However, for the LOD on-            pages 610–621, 2002.
tology matching task, this approach is unpromising, because
different scopes of the authors of LOD data sets lead to large    [5] K. Eckert, C. Meilicke, and H. Stuckenschmidt. Im-
differences in the ontology structure. For instance, the two          proving ontology matching using meta-level learning.
data sets of DBpedia12 and YAGO13 represent knowledge                 In ESWC, pages 158–172, 2009.
extracted from Wikipedia. Nevertheless, the ontologies of
both data sets vary significantly in structure, whereas many      [6] J. Euzenat, A. Ferrara, C. Meilicke, J. Pane,
common concepts are shared.                                           F. Scharffe, P. Shvaiko, H. Stuckenschmidt, O. Šváb Za-
   In 2011 a new bibliographic benchmark data set was gen-            mazal, V. Svátek, and C. Trojahn. Results of the Ontol-
erated [7]; the results of HCM are the equivalent for the             ogy Alignment Evaluation Initiative 2010. In Proceed-
new data set.                                                         ings of the International Workshop on Ontology Match-
                                                                      ing at ISWC, 2010.

5.   CONCLUSION AND FUTURE WORK                                   [7] J. Euzenat, A. Ferrara, W. R. van Hage, L. Hollink,
   In this paper, we tackle the problem of large-scale con-           C. Meilicke, A. Nikolov, F. Scharffe, P. Shvaiko,
cept matching – one of the major challenges in ontology               H. Stuckenschmidt, O. Šváb Zamazal, and C. Trojahn.
matching [19]. In contrast to other approaches that process           Final results of the Ontology Alignment Evaluation Ini-
pairs of ontologies, our focus is on aligning many ontologies         tiative 2011. In Proceedings of the International Work-
from the LOD cloud simultaneously in a holistic manner.               shop on Ontology Matching at ISWC, 2011.
To this end, we propose an abstract workflow (knowledge           [8] J. Euzenat and P. Shvaiko.         Ontology Matching.
extraction, grouping, and alignment generation) to specif-            Springer-Verlag, 2007.
ically enable scalability while still examining all ontologies
in its entirety. Further, we plug state-of-the-art techniques     [9] B. He and K. C.-C. Chang. Statistical Schema Match-
and novel ideas into this workflow and report on promising            ing across Web Query Interfaces. In Proceedings of
results for scalability and alignment quality in a web-scale          the ACM International Conference on Management of
alignment scenario. For representing knowledge, we chose              Data (SIGMOD), pages 217–228, 2003.
Wikipedia Category Forests. For grouping the input, we
leverage topical information. Last, the alignment generation     [10] B. He, K. C.-C. Chang, and J. Han. Discovering com-
leverages Wikipedia Category Forest overlaps and performs             plex matchings across web query interfaces: a correla-
a semantic inspection.                                                tion mining approach. In Proceedings of the Conference
   We have many ideas for future directions: For instance, we         on Knowledge Discovery and Data Mining, pages 148–
will look into other knowledge representations and matching           157, 2004.
techniques that can be plugged into the workflow. The chal-
                                                                 [11] J. Hopcroft and R. Tarjan. Algorithm 447: efficient
12
   http://dbpedia.org/                                                algorithms for graph manipulation. Communications
13
   http://www.mpi-inf.mpg.de/yago-naga/yago/                          of the ACM, 16:372–378, 1973.
[12] J. Huber, T. Sztyler, J. Nößner, and C. Meilicke. Codi:   [20] E. Rahm. Towards Large-Scale Schema and Ontology
     Combinatorial optimization for data integration: re-            Matching. In Schema Matching and Mapping, chap-
     sults for oaei 2011. In Proceedings of the International        ter 1, pages 3–27. Springer Berlin / Heidelberg, 2011.
     Workshop on Ontology Matching (OM), 2011.
                                                                [21] L. A. Ribeiro and T. Härder. Efficient Set Similarity
[13] P. Jain, P. Hitzler, A. P. Sheth, K. Verma, and P. Z.           Joins Using Min-prefixes. In Advances in Databases and
     Yeh. Ontology Alignment for Linked Open Data. In                Information Systems, pages 88–102. Springer Berlin /
     Proceedings of the International Semantic Web Confer-           Heidelberg, 2009.
     ence (ISWC), 2010.
                                                                [22] D. Ritze and H. Paulheim. Towards an Automatic
[14] P. Jain, P. Z. Yeh, K. Verma, R. G. Vasquez,                    Parameterization of Ontology Matching Tools based
     M. Damova, P. Hitzler, and A. P. Sheth. Contextual              on Example Mappings. In P. Shvaiko, J. Euzenat,
     Ontology Alignment of LOD with an Upper Ontology:               T. Heath, C. Quix, M. Mao, and I. Cruz, editors, Pro-
     A Case Study with Proton. In Proceedings of the Ex-             ceedings of the 6th International Workshop on Ontology
     tended Semantic Web Conference (ESWC), 2010.                    Matching, volume 814, pages 37–48, http://ceur-ws.
[15] Y. R. Jean-Mary, E. P. Shironoshita, and M. R.                  org, October 2011. CEUR Workshop Proceedings.
     Kabuka. Ontology matching with semantic verification.
     Web Semantics: Science, Services and Agents on the         [23] K. Saleem, Z. Bellahsene, and E. Hunt. PORSCHE:
     World Wide Web, 7:235–251, 2009.                                Performance ORiented SCHEma mediation. Informa-
                                                                     tion Systems, 33:637–657, 2008.
[16] Y. Lee, M. Sayyadian, A. Doan, and A. S. Rosenthal.
     etuner: tuning schema matching software using syn-         [24] W. Su, J. Wang, and F. Lochovsky. Holistic Schema
     thetic scenarios. The VLDB Journal, 16:97–122, Jan-             Matching for Web Query Interfaces. In Proceedings
     uary 2007.                                                      of the International Conference on Extending Database
                                                                     Technology (EDBT), pages 77–94, 2006.
[17] J. Li, J. Tang, Y. Li, and Q. Luo. RiMOM: A                [25] F. M. Suchanek, S. Abiteboul, and P. Senellart. PARIS:
     dynamic multistrategy ontology alignment framework.             probabilistic alignment of relations, instances, and
     IEEE Transactions on Knowledge and Data Engineer-               schema. Proceedings of the VLDB Endowment, 5:157–
     ing (TKDE), 21:1218–1232, 2008.                                 168, 2011.
[18] A. Nikolov, V. Uren, E. Motta, and A. de Roeck. Over-
     coming Schema Heterogeneity between Linked Seman-          [26] C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li,
     tic Repositories to Improve Coreference Resolution. In          W. Tian, J. Xu, and R. Li. MapDupReducer: detecting
     Proceedings of the Asian Conference on The Seman-               near duplicates over massive datasets. In Proceedings
     tic Web, pages 332–346. Springer Berlin / Heidelberg,           of the ACM International Conference on Management
     2009.                                                           of Data (SIGMOD), pages 1119–1122, 2010.

[19] S. Pavel and J. Euzenat. Ontology Matching: State          [27] C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient sim-
     of the Art and Future Challenges. IEEE Transactions             ilarity joins for near duplicate detection. In Proceed-
     on Knowledge and Data Engineering (TKDE), 2012, to              ings of the International World Wide Web Conference
     appear.                                                         (WWW), pages 131–140, 2008.