=Paper=
{{Paper
|id=None
|storemode=property
|title=Holistic and Scalable Ontology Alignment for Linked Open Data
|pdfUrl=https://ceur-ws.org/Vol-937/ldow2012-paper-08.pdf
|volume=Vol-937
|dblpUrl=https://dblp.org/rec/conf/www/GrutzeBN12
}}
==Holistic and Scalable Ontology Alignment for Linked Open Data==
Holistic and Scalable Ontology Alignment for Linked Open Data Toni Gruetze Christoph Böhm Felix Naumann Hasso Plattner Institute Hasso Plattner Institute Hasso Plattner Institute Potsdam, Germany Potsdam, Germany Potsdam, Germany toni.gruetze@hpi.uni-potsdam.de christoph.boehm@hpi.uni-potsdam.de felix.naumann@hpi.uni-potsdam.de ABSTRACT ples of such LOD applications are showcased on data.gov The Linked Open Data community continuously releases and data.gov.uk. massive amounts of RDF data that shall be used to eas- However, the LOD vision includes the connection of data ily create applications that incorporate data from different from different sources via links to facilitate an easy integra- sources. Inter-operability across different sources requires tion. As of September 2011, the LOD cloud comprised 295 links at instance- and at schema-level, thus connecting en- sources, which all fulfill the basic LOD principles2 . In con- tities on the one hand and relating concepts on the other trast, the number and quality of links across these sources hand. State-of-the-art entity- and ontology-alignment meth- are an ongoing issue since their discovery is a major chal- ods produce high quality alignments for two “nicely struc- lenge [1]. In this paper, we focus on schema-level links, tured” individual sources, where an identification of relevant i.e., ontology alignments across data sources; of the 295 and meaningful pairs of ontologies is a precondition. Thus, data sources 190 use proprietary vocabulary terms. Out these methods cannot deal with heterogeneous data from of these 190 sources, only 15 offer mappings to other widely many sources simultaneously, e.g., data from a linked open deployed vocabularies; but 159 provide dereferencable URIs data web crawl. for proprietary terms, i.e., descriptive information for these To this end we propose Holistic Concept Matching “new terms” are available. Thus, the evolution of the web (HCM). HCM aligns thousands of concepts from hundreds of vocabulary terms requires to run ontology alignment ap- of ontologies (from many sources) simultaneously, while proaches on (all) pairs of sources without any (or with only maintaining scalability and leveraging the global view on few) mappings to other vocabularies. the entire data cloud. We evaluated our approach against State-of-the-art ontology matching has been designed to the OAEI ontology alignment benchmark as well as on the cope with nicely structured and well defined ontologies in or- 2011 Billion Triple Challenge data and present high preci- der to produce high-quality mappings for one pair of sources sion results created in a scalable manner. from one specific domain at a time. Instead, in the case of data that stems from the web, we need approaches that can (simultaneously) deal with heterogeneous and incomplete 1. INTRODUCTION vocabulary definitions from many different sources dealing In 2006 Berners-Lee proposed the Linked Open Data with various topics. Further, the vocabularies from the LOD (LOD) design principles1 . These principles outline the vi- cloud as a whole allow a holistic view on the web of vocabu- sion of a global data cloud that can be used to build novel lary terms and thus to create alignments depending on other applications incorporating information about real-world en- alignments and dependencies. Resulting alignment informa- tities from many sources. He suggests dereferencable HTTP tion across many sources can be used for web query answer- URIs for naming things that return useful information when ing or the discovery of sources with respect to specific topics. looked up on the web. Further, he encourages links to other However, it is a major scalability challenge to deal with very URIs so that one can discover more information about things many vocabulary terms gathered from the linked data web. under consideration. The semantic web community adopted Therefore, we tackle one the major challenges in ontology these suggestions and we are now witnessing the growth of a matching, namely the matching at a very large scale [19]. giant global data cloud comprising information form many We further add the requirement to be applicable to real- sources using de-facto standards such as RDF, RDFS, OWL, world web data from various origins instead of two specific etc. The pure availability of this data following a set of sources. principles is a big win since, in general, the use of (non- In the following section we briefly review state-of-the-art linked) open data requires source-specific approaches to ac- techniques for aligning ontologies and then derive require- cess, query and filter the data of interest. Instead, when pub- ments for an approach the deals with heterogeneous vocab- lished as LOD, one can leverage different sources through ulary definitions from the web of data (Sec. 1.1). Next, common mechanisms such as HTTP and SPARQL. Exam- we outline our general approach for aligning multiple LOD vocabularies (Sec. 1.2), followed by a technical description 1 http://www.w3.org/DesignIssues/LinkedData.html of respective alignment phases (Sec. 2–4). Along with the technical details of our approach we present measurements to convince the reader of the respective phase’s scalability. Copyright is held by the author/owner(s). 2 LDOW2012 , April 16, 2012, Lyon, France http://www4.wiwiss.fu-berlin.de/lodcloud/state/ 1.1 State-of-the-art only. That is, it can neglect instance data due to its The matching of data models is an essential task for var- immense size. Further, property definitions can be ig- ious areas in the information science, e.g., information inte- nored since we found that property definitions and their gration, peer-to-peer data management, Web service compo- actual usage differs largely. sition, etc. [8]. Thus, a wide range of approaches were pub- lished in the area of schema and ontology matching. Current • Ontology structures should not be used for the ac- ontology matching approaches often enter the annual Ontol- tual alignment creation (if available at all). This is for ogy Alignment Evaluation Initiative (OAEI) [7]. The OAEI scalability reasons and since the structure across diverse aims at comparing the performance of different systems by ontology varies largely and is thus not beneficial. How- assessing respective strengths and weaknesses. Successful ever, the approach can exploit structural information, participants are, for instance, [2–4, 12, 15, 17]. Many state- such as subclass relationships, to verify derived align- of-the-art approaches are based on the combination of dif- ments and find semantic contradictions. ferent basic matching techniques and require a parameter • The approach must return equivalence relation- optimization for each matching task. This configuration is ships for vocabulary terms. These relationships can addressed by different meta matching systems [5, 16, 22]. be fuzzy (depending on some parameter) since differ- However, most of the state-of-the-art approaches have not ent ontologies have different granularities, e.g., um- been run on large and heterogeneous ontologies that stem bel:SpacePlatform Manned ∼ dbpedia:SpaceStation. from the LOD cloud. Nevertheless, the number of ontology matching ap- Please note that others might make other design decisions; proaches dedicated to ontologies from the LOD domain has however, due to the immense scale and a questionable qual- grown recently. Those LOD approaches commonly use only ity of current general web data we follow these properties. one (or few) basic matching techniques. Often, these tech- niques utilize special RDF and LOD characteristics to im- 1.2 Alignment approach overview prove the matching result. For instance, Nikolov et al. in- The aforementioned properties shall hold for an approach troduce an approach that utilizes owl:sameAs links between that can be applied to web-scale input data. However, at the data sets in the LOD cloud to derive concept overlaps [18]. same time we target an holistic view on the data to leverage However, there are hundreds of ontologies in the Web of the information at hand as a whole. Therefore, we propose Data, which have been created for different use cases but the Holistic Concept Matching approach (HCM). To address still contain many overlaps worth discovering. Therefore, the scalability challenge we group the input data by topic the ability to perform cross-domain matching is especially and thus create small groups of concepts that can be aligned important. Jain et al. introduced an approach that utilizes locally. Since we group by topics, we can still infer relation- Wikipedia categories for bootstrapping. This way, they can ships holistically, i.e., draw conclusions not only based on cover knowledge from various domains [13, 14]. Further- a pairwise similarity but additionally based on alignments more, the LOD ontologies are commonly very large and con- and dependencies among other topically related members of tain many instances. Suchanek et al. use instance knowledge the same group. to match instance and schema level entities [25]. Consider for instance five sources (A, B, C, D, E) and Recently, Rahm surveyed different approaches to large- let {a1 , . . . , am , b1 , . . . , bn , c1 , . . . , co , d1 , . . . , dp , e1 , . . . , eq } scale matching tasks [20]. Besides different basic techniques be the concepts from these sources. Running a traditional to reduce runtime and manual configuration effort, he explic- approach on all available pairs of ontologies would result in itly emphasized holistic schema matching. Approaches of up to 52 isolated runs, which is computationally infeasible this type process a set of input schemata in one run (like [23]) given that alignment approaches often base on complex lex- and infer knowledge from the entirety of schemata [9, 10, 24]. ical and structural properties [2, 15, 17]. Also, isolated runs In this paper we we adopt this notion for ontology alignment may yield low-quality results, since vocabulary definitions on a large scale. from the web can be incomplete and highly heterogeneous The input for web-scale ontology alignments is a set C of in terms of granularity and structure of the ontology. concepts that has been gathered from the LOD cloud via With the help of additional topic knowledge for the web access methods. The concepts in C stem from many sources and their entities we are able to group them. different ontologies. For each concept in C, there is at least For instance, given that the three sources (A, B, C) a URI – its id. Preferably, there are further information, store media related entities and (D, E) provide in- such as a label, a description, or a comment. Additionally, formation from the life sciences, we examine groups there can be further meta data like structural information. of concepts, e.g., {a1 , . . . , am , b1 , . . . , bn , c1 , . . . , co } and For the processing of these large amounts of heterogeneous {d1 , . . . , dp , e1 , . . . , eq } – yielding fewer runs of the approach. ontological information we suggest the following properties Note that these groups must not directly relate to the input for an alignment approach: ontologies: For instance, if d1 = human, then it could also reside in the first group. However, within these groups, we • The approach must work with ontologies from diverse can then run computationally more complex approaches to domains, i.e, the underlying concept matching strategy take an holistic view based on multiple ontologies. must be applicable to many domains. Specifically, our general approach is shown in Fig. 1. • Due to a large input concept set, the approach must per- Given the data from the LOD cloud, we extract a knowl- form automatic alignment in sub-quadratic run- edge representation for all available concepts. This repre- time (in the number of concepts). sentation can be a simple descriptive string, a feature vector, or a more complex data structure. Next, we apply topical • The approach should process concept definitions grouping in order to create smaller sets of concepts. Within knowledge alignment (c) In recursion 3 ≤ h determine β’s super-categories extraction grouping generation {γ1 , . . . , γs }. (d) etc. We now explain the three steps on more detail. Step 1. To determine keywords kw(c) for a concept c we concept have tested several methods. In the following, we use two or- LOD knowledge candidate consistent thogonal base methods, namely the TFIDFn -extraktor and cloud representation groups alignments ConceptID-extraktor. The TFIDFn -extractor processes de- scriptions of concepts, i.e., comments and labels. For this, Figure 1: Scalable and holistic LOD ontology align- we merge description texts and tokenize them and neglect ment stop words3 . Then, we determine the tf-idf-score for each token with respect to the overall corpus of concept descrip- tions. Finally, we select the top-n tf-idf-ranked tokens as these sets, we then create alignments by identifying similar keywords kw(c). knowledge representations and reasoning among them using The ConceptID-extractor is solely based on the concept’s additional structural information from the input as well as URI. The concept id is the unique identifier of a concept further candidates from the same group. in its ontology definition. HCM uses a straight-forward ap- Obviously, many techniques can be plugged into this ap- proach to determine the concept id: We truncate the prefix proach. Depending on the knowledge representation, one of the URI until the last occurrence of either #, :, or /. The can choose the specific representation of a concept’s topic, concept id must not contain any other character than letters specific topic similarities as well as specific concept similar- or underscores. To extract tokens, we split the concept id at ities in order to find alignments. In the remainder of this camel-case characters and underscores. After again remov- paper, we report on the adoption of the Wikipedia cate- ing stop words, this results in the set of keywords kw(c). For gory forest [13] for HCM (Sec. 2). The topical grouping is instance, the concept id for http://umbel.org/umbel/rc/ done using a set similarity index (Sec. 3). For the align- SpacePlatform_Manned would be SpacePlatform Manned . ment generation we combine the Wikipedia category forest From this we create kw(c) = {space, platf orm, manned}. similarity [14] and a rule-based verification approach [15] Additionally, HCM supports the ConceptID TFIDFn - (Sec. 4). extractor that combines both methods. First, ConceptID keywords are extracted. If no WCF can be constructed 2. KNOWLEDGE REPRESENTATION (see next steps) HCM tries to build a forest using TFIDFn keywords instead. Thus, the ConceptID TFIDFn -extractor The comparison of different concepts requires an abstract maximizes the amount of concepts that can be represented representation of its semantic content, i.e., a knowledge rep- by an WCF. resentation. First we present our method to compute this representation, then we show our performance results. Step 2. Next, HCM performs a Wikipedia full-text search 2.1 Wikipedia Category Forests using a search query that is built by concatenating all key- words kw(c) (delimited by spaces). From the resulting Given a set C of concepts from many different ontologies Wikipedia pages we choose the top d pages as tree root for gathered from the LOD cloud, we now elaborate on the data the creation of a WCF. The parameter d – the forest depth structure we use to represent a concept. To this end, we – influences a WCF’s coverage of conceptual meanings. To chose Wikipedia Category Forests (WCF) as proposed for catch all potential meanings of a concept, several Wikipedia the BLOOMS algorithm in [13, 14], as BLOOMS is a state- articles and their category hierarchies have to be considered, of-the-art alignment algorithm for matching heterogeneous i.e., it requires a deep forest. On the other hand, a deeper ontologies from many domains. A WCF is a set of Wikipedia forest has a higher risk to contain trees not related to the category trees created as follows: actual concept. Therefore, the selection of the forest depth 1. Given a single concept c ∈ C, create a list of keywords parameter is a trade-off between semantic coverage and ir- kw(c) = {k1 , . . . , kn } from c. relevance. 2. Given kw(c) = {k1 , . . . , kn }, a Wikipedia search for Step 3. The tree construction builds trees recursively from all keywords returns a ranked list of Wikipedia pages the root to the maximal tree height h. The height mainly R = {p1 , . . . , pm }. These pages are roots of the trees influences the level of abstractness of a WCF. The higher in the resulting forest. the trees, the more category hierarchy layers are considered. This thus leads to more abstract categories in the WCF. 3. For each page p and a height parameter h, construct Instead, the lower the trees, the less probable is a topical a tree for the h top-ranked pages {p1 , . . . , ph } in R, in overlap among related concepts. We will discuss different the following recursive manner: parameter configuration in the experiments section. For instance given resource umbel:MannedSpacecraft, the (a) In recursion 1 ≤ h determine p’s categories ConceptID-extractor yields {manned, spacecraf t}, which {α1 , . . . , αq }. 3 using the English stop word lists of Apache Lucene (v2.3.0) (b) In recursion 2 ≤ h determine α’s super-categories and of the MediaWiki-plugin LuceneSearch (v2.1.3) with 33 {β1 , . . . , βr }. respectively 121 words ar#cle category Measurements. Given the BTC data where we could ex- aerospace engin. vehicles by media space tech. pneuma#cs struct. engin. gas tech. tract 1M concepts, the overall keyword extraction time is 2nd spaceflight aerospace engin. spaceflight hydraulics pressure containers 7 minutes (Step 1). The average number of keywords per layer set, i.e., per concept, is 2.77. 1st As expected, the Wikipedia index query time (Step 2) is layer astronau#cs spacecra7 pressure vessels linear to the number of extracted keyword sets. For instance, root node spacecra7 given 293k keyword sets from the ConceptID TFIDF3 - extractor, querying took 64 minutes. Figure 2: Tree for the Wikipedia article “Spacecraft” The input to the keyword set extractors comprises 1M with h = 2. The 2nd layer categories span two rows concepts. However, the ConceptID-extractor cannot yield a for readability. result for all input concepts, since we do not deal with blank nodes10 and malformed URIs. Further, not all keyword sets created yield a query result; in the end 238k keyword sets results in the following Wikipedia search result (d = 3): could be used to build WCFs. Nevertheless, when addition- {Human spacef light, Spacecraf t, Orion (spacecraf t)}. ally applying the TFIDF3 -extractor we can further create Figure 2 depicts the tree for the root article Spacecraf t another 55k WCFs. (h = 2). As one can see, the higher the layer, the more The forest construction runtime mainly depends on forest nodes (Wikipedia categories) exist and the more abstract depth d and tree height h. Table 1 shows runtimes for the are these categories. Furthermore, we note that different forest construction algorithm with different parameters. Ap- tree nodes occur multiple times, e.g., “spaceflight” or parently, the runtime linearly depends on the forest depth d. “aerospace engineering”. These nodes play an essential role An increase of the tree height h leads to an exponential run- in the next phase of HCM (Sec. 3). time increase. 2.2 Experiments h\d 5 10 15 20 1 0:06h 0:09h 0:11h 0:14h Setup. All experiments were performed on a Windows 2 0:15h 0:23h 0:28h 0:38h 2008 R2 Enterprise Server with two quad-core Intel Xeon 3 0:31h 0:52h 1:18h 1:34h processors (2.66 GHz) and 30GB of memory. To evaluate 4 1:13h 3:03h 3:55h 5:45h the performance of our approach in a web scale scenario, we 5 4:14h 8:28h 14:01h 15:40h used the 2011 Billion Triple Challenge (BTC) data4 . The Table 1: Forest construction runtime using BTC data is a crawl from the LOD cloud and consists of different tree h and d parameters and the approximately 2.2 billion triples. We have implemented all ConceptID TFIDF10 -extractor (in hours) described methods in Java. The extraction of concept information can be done in lin- ear time with Hdrs – a scalable distributed RDF store5 Similar to Jain et al., we set tree height h = 4 and forest – that allows fast in-order index scans. HCM selects con- depth d = 10 for all following experiments [13]. For this cepts by considering resources occurring in triples with dif- configuration, the WCFs consists of 9.53 trees on average. ferent predicates: rdf:type, rdfs:subClassOf , rdfs:domain, rdfs:range, owl:equivalentClass, owl:disjointWith, rdfs:label , and rdfs:comment. In this manner HCM identified ap- 3. CANDIDATE GROUP CREATION prox. 1M concepts in the BTC. The previous step extracts WCFs, i.e., our knowledge In contrast to the original BLOOMS algorithm [13], we representation of choice. Given these knowledge represen- cannot use the Wikipedia search service. This is because we tations, we now need to find topically related concepts in need to issue very many search requests since we deal thou- order to group them. Note that in the following we use the sands of concepts. This vast amount of queries would lead to terms domain and topic interchangeably. A domain or topic a high network overhead when querying the search service. is a field of the real world where different concepts together Therefore, for HCM, we load a Wikipedia dump6 into a play a role. The underlying intuition for the following is full-text index (articles and category hierarchy). The index that a search for alignments among concepts within (and was created with Apache Lucene 7 (v2.3.0) and a modified not across) topics should not cause a major loss of recall. version of the LuceneSearch 8 MediaWiki-plugin implemen- Again, we first present our method, followed by an evalua- tation (v2.1.3). Additionally, we used MongoDB9 (v1.6.5) tion. running on the same machine like the HCM implementation. HCM uses MongoDB to store intermediate results between 3.1 Topical groups the three phases (Sec. 2, 3, and 4). In our approach, the grouping (by topic) shall reduce 4 http://km.aifb.kit.edu/projects/btc-2011 the runtime complexity of the alignment generation step by 5 http://code.google.com/p/hdrs splitting the problem space into smaller independent tasks 6 We used the English Wikipedia dump of August 3rd, 2011 (Sec. 4). This procedure is often referred to as blocking or with approx. 11 million pages: http://dumps.wikimedia. partitioning. org/enwiki/20110803/ Next, we discuss the details of finding topically related 7 http://lucene.apache.org/java/docs/index.html WCFs from the previous step. Given a set F of WCFs, we 8 http://www.mediawiki.org/wiki/Extension: need to identify disjoint groups G1 , . . . , Gn ⊂ F of topically Lucene-search 9 10 http://www.mongodb.org/ Approx. 50% of all input concept IDs are blank nodes 300 60 related forests (topic group). In HCM we implemented the number of forests in groups (in thousands) following procedure: 250 50 number of groups (in thousands) 1. For each WCF f ∈ F , extract topic(f ) = {t1 , . . . , tm } >100k ≤100k from f . We describe a topic of a WCF using a set of 200 40 ≤50k tree nodes extracted from the WCF. ≤5k 150 30 ≤250 2. Given a topic topic(f1 ), identify all forests f2 ∈ F with ≤25 a high topical overlap (topic(f1 ) ' topic(f2 )). 100 20 ≤5 3. Given a set of pairs (f1 , f2 ) with a high topical overlap, 50 10 determine groups G of forests by associating topically =2 overlapping forests transitively. 0 0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 topic similarity threshold (θJ) Step 1. The topic topic(f ) of a WCF f is a subset of its Number of forests in groups Number of forest groups tree nodes ({p, α1 , . . . , αq , β1 , . . . , βr , . . . }). In the process of finding characteristic nodes that represent this concept’s Figure 3: The forest group distribution for h = 4, topic, we ignore very specific categories like “Soviet manned d = 10, the ConceptID TFIDF keyword extractor, space program” and very generic categories like “Humans” or and the TFIDFForest10 topic extractor for varying “Wikipedia article lists”. In this paper we report on a simple θJ . The green graph shows the number of forests yet promising approach we have tested (among others) for that can be grouped (left axis). The shaded areas the topic extraction: The TFIDFForestn -extractor again uti- below indicate the number of forest in groups of a lizes the tf-idf measure. Nodes common in a WCF are scored specific size. The dashed red line shows the total with a high term-frequency-value (tf), whereas nodes popu- number of groups (right axis). lar in many WCFs are scored with a low inverse-document- frequency-value (idf). We rank a WCF’s tree nodes in de- scending tf-idf order and select the top-n for representing the topic. The choice of n is crucial for the quality of the tree the number of forests that can be grouped for a given θJ node set representing the concepts’s topics. Experiments (green graph, left axis). Shaded areas below the green graph indicated n = 10 to be a reasonable choice. Higher values depict the number of forests in groups of a given size, e.g., lead to very specific topics, whereas smaller n lead to a low size = 2, size ≤ 5, size ≤ 25, etc. Also, we show the total representativity. number of groups over θJ (dashed red graph, right axis). As for the fraction of WCFs that can be grouped, a higher Step 2. Next, we compare the topic sets to identify related θJ leads to a decrease since the higher θJ , the fewer WCF forests. We use the Jaccard coefficient J and a correspond- pairs have a sufficient overlap to be grouped. The shaded ing threshold θJ to determine the similarity of topic sets areas further show that the higher θJ , the more WCFs fall in (J(topic(f1 ), topic(f2 )) ≥ θJ ). The naı̈ve pairwise compari- small groups, since a stricter overlap criterion induces more son of all forest topic sets leads to a quadratic runtime be- individual groups. Vice versa, a lower θJ results in larger havior with respect to the number of forests |F |. To re- groups since smaller groups are merged in this case. duce this effort we tested different set-based similarity self As for the total number of groups for a specific Jaccard join techniques and selected ppjoin [27], which delivered the threshold, there is a clear increase for higher θJ . This is due most promising results (see Sec. 3.2). The ppjoin is a set to the fact that there are more smaller groups for higher θJ . similarity self join technique that reduces the amount of ob- The figure shows that θJ = 1.0 leads to 138k (out of 293k) ject pairs to be compared by using an inverted index and forests have at least one correspondence with an identical two filtering techniques on the candidate sets (prefix and topic set. The majority (92k) appears in groups of size two. positional filtering). Our implementation is an adapted ver- For the following experiments we set θJ = 0.7 as default sion of an open source implementation11 . The application for three reasons: (1) This avoids very large WCF groups of a set-based similarity self join technique enhances perfor- which would raise the runtime of the alignment generation mance and still keeps the topical cohesion of the group. phase. With θJ = 0.7, the maximal group size of 4k leads to a reasonable (and feasible) runtime. (2) θJ = 0.7 provides a significant topical overlap among forests and minimizes Step 3. To determine groups of topically related WCFs, haphazard correspondences. (3) This threshold value en- we build a graph containing forest nodes F and edges be- ables the identification of groups for more than 55% of the tween topically related forests (J(topic(f1 ), topic(f2 )) ≥ θJ ). available WCFs (162k). Remaining WCFs do not have a Next, we use a depth-first search to determine connected sufficient topic overlap and can thus not be considered for components within the graph [11]. Then, all forests in a the alignment generation. component are either directly or indirectly connected to each In general, the runtime of the candidate group generation other, but have no connection to forests in other compo- depends on the number of input WCFs, the applied self- nents. join technique, and topic set homogeneity (within the input 3.2 Experiments set). Figure 4 compares the runtime of three different set similarity join techniques with groups for TFIDFForest10 Figure 3 depicts the WCF group distribution in the BTC topics, θJ = 0.7, and varying number of WCFs. The red data over the similarity threshold θJ . For one, we illustrate graph (triangle markers) shows the quadratic runtime of the 11 https://code.google.com/p/similarity-join-tools/ naı̈ve baseline approach that compares all forests in a pair- 3 date alignments. Finally, extract alignments A from the conflict-free alignment graph D. 2 Step 1. First, we extend each topic group G with further run-time in hours related WCFs in order to incorporate immutable axioms naive from the underlying ontologies. Specifically, related WCFs mpjoin f2 have an ontology-level relation in common with a WCF ppjoin 1 f1 ∈ G, i.e., owl:equivalentClass and owl:disjointWith rela- tionships. We did not select other relations since we deem these two as most relevant to reveal semantic conflicts of alignments while keeping the group size low (which is a de- 0 sirable property). In the following, the topic group G refers 0 50 100 150 200 250 300 to the extended topic group. forest set size (in thousands) Step 2. Next, we compare all pairs of WCFs f1 , f2 ∈ G Figure 4: The forest grouping run-time for different to determine the overlap of the forests. HCM uses the similarity join algorithms. The experiment was ex- tree overlap similarity of BLOOMS+ to compare individ- ecuted for θJ = 0.7 using the TFIDFForest10 topic ual trees [14]. The tree overlap aggregates common nodes extractor and h = 4 and d = 10. between two trees as follows: Given two arbitrary trees t1 , t2 from different WCFs, the tree overlap is defined as: wise manner. The yellow line (rhomb markers) indicates −1 log n∈(t1 ∩t2 ) 1 + ed(n,t1 ) −1 P the performance of the mpjoin introduced by Ribeiro and Härder [21]. In contrast to the ppjoin, mpjoin tries to min- Overlap(t1 , t2 ) = log 2|t1 | imize the effort for the candidate set determination instead of minimizing the candidate set size. The green line (rectan- Thus, the tree overlap depends on the distance d of common gle markers) illustrates the runtime of the ppjoin algorithm. nodes n ∈ (t1 ∩t2 ) to the root of t. The higher the node depth The ppjoin algorithm performs best and has a nearly linear in t (d(n, t)), i.e., the more nodes are between n and the root runtime. Here, the similarity join of 295k forest topics takes page p of t, the smaller the influence of the shared node. The only 6 minutes. Remember that these numbers have been depth of the root page is set to d(p, t) = 1. For instance, created with a very large and heterogeneous sample from the given the tree t1 with the root p = “spacecraft” (article) of current Web of Data – the BTC. Thus, we can demonstrate our previous example (see Fig. 2) and an arbitrary tree t2 . a reasonable runtime for current very large web data sets. Assuming two overlapping nodes x = “astronautics” and y = What is more, Wang et al. introduce a MapReduce imple- “aerospace engin.” between t1 and t2 such that t1 ∩ t2 = mentation of ppjoin [26], which leads to the conclusion that {x, y}, the resulting depths of the nodes in t are d(x, t) = 1 our approach can also be run on even larger future datasets. and d(y, t) = 2. Therefore, the resulting tree overlap of both trees in t1 is: 1 log 1 + e0 + 1 + e− 2 4. ALIGNMENT GENERATION Overlap(t1 , t2 ) = ≈ 0.385 log(2 × 14) Given the set of candidate groups the next task is to re- gard each group and find alignments among their respective The tree overlap is not symmetric (Overlap(t1 , t2 ) 6= members independently. This leads to runtime reduction Overlap(t2 , t1 )). With this asymmetry BLOOMS+ is able decreased due to the reduced input set size (|G| |F |) and to identify parent-child relationships between concepts. The the possible parallelization of alignment generation task for equivalence between concepts is derived if Overlap(t1 , t2 ) = different groups In the following, we elaborate on our un- Overlap(t2 , t1 ). However, we are interested in a similarity derlying matching strategy to identify relations among con- value for two WCFs instead of individual trees. Therefore, cepts, followed by an experimental evaluation. we compare all pairs of trees of different WCFs and select the best matching tree pair. Given two WCFs f1 and f2 4.1 WCF alignments (f1 , f2 ∈ G; f1 6= f2 ), we define the forest similarity as the Given a set G of topically related WCFs, we need to iden- maximal harmonic mean for the overlaps of all tree pairs tify a set of alignments A. We propose a procedure that, (∀t1 ∈ f1 , t2 ∈ f2 ): again, consists of three steps: 2Overlap(t1 , t2 )Overlap(t2 , t1 ) 1. Extend G by adding related WCFs. These relations O (f1 , f2 ) = arg max t1 ∈f1 , Overlap(t1 , t2 ) + Overlap(t2 , t1 ) t2 ∈f2 originate from the original ontology definitions. 2. Compare all forest pairs f1 , f2 ∈ G using a forest over- Then, we select relevant matches M by using an overlap lap score function O. Extract all forest pairs with an threshold O(f1 , f2 ) ≥ θO . The selection of a good threshold overlap value exceeding a given threshold O(f1 , f2 ) ≥ is essential for the alignment quality, because a low θO -value θO ; add them to match candidate set M . leads to a high amount of irrelevant candidates and thus a lower precision. The following semantic deduction might 3. Create an alignment graph D by iteratively adding not be able to eliminate erroneous candidates. In addition, candidate matches M that neither semantically con- the run-time complexity in the semantic verification phase flict with ontology definitions, nor with other candi- grows. A high θO -value instead leads to fewer candidates f3 x x and thus a low recall. In Sec. 4.2 we evaluate the different equivalence disjointness parentship thresholds and their influence on the alignment quality. ex ex ex ex ex f1 f2 f1 x x f2 f1 f2 f1 f2 f1 f2 x x x Step 3. Next, HCM performs a logical reasoning to remove a a' a a' a x a' a a' a a' conflicting matches from candidate sets and thus enhance f3 f3 x f3 x f1 f3 f3 f2 the precision. HCM draws conclusions for matches contra- (a) (b) (c) f3 (d) (e) dicting each other. This is done using the holistic knowledge of all ontologies. Given a ranked set of forest match candi- Figure 5:x Five x types x ofx alignment x x inferences, x x that x x dates M , the semantic verification will produce an alignment infer a closure a0 from a new alignment a and an ex- equivalence disjointness equivalence parentshipdisjointness equivalence parentshipdisjointness equivalence parentshipdisjointness equivalence parentshipdisjointness parentsh graph D = (G, E) with a set of WCF nodes G and a set of di- isting edge ex in the matching graph. (a) shows an rected edges E ⊂ G×G representing the verified alignments. equivalence closure for type(e x ) = type(a) = equif . (b) f f f f The edges are annotated with three different attributes: and (c) show disjoint 1 f 2 closures 1 f for type(e 2 f 1 x ), type(a) 2 ∈ f 1 2 1 f 3 f 3 f 3 f 3 f 3 {equi , disj }; type(ex ) 6= type(a). (d) and (e) show par- • The certainty of an alignment cert : G × G → R10 spec- entship and childship closures for type(ex ), type(a) ∈ ifies the confidence of a matching. The higher the value, {parent, child }; type(ex ) = type(a). the more reliable the match. • The alignment type classifies every matching pair in HCM checks alignments for multiple-entity correspon- one of five categories type : G × G → {equi, disj , parent, dences, crisscross correspondences, and disjointness- child , onto}. An equi edge connects two concepts iden- subsumption contradictions. These three types of con- tified to be equal, whereas a disj edge marks two con- tradictions were originally introduced for the ASMOV on- cepts to be disjoint. A parent or child edge is a directed tology alignment approach by Jean-Mary et al. [15]. subset information representing an rdfs:subClassOf re- If and only if no contradiction was found for the can- lationship. An onto labeled edge marks two concepts to didate alignment a, HCM adds the alignment candidate originate from the same source ontology. and its inverse a−1 to the graph’s edge set E. The inverse alignment a−1 = (f2 , f1 ) of a = (f1 , f2 ) has got an analog • The alignment origin provides information about the alignment identification (origin(a−1 ) = origin(a)) and cer- edge’s origin, origin : G × G → {def , detect, infer }. tainty (cert(a−1 ) = cert(a)) and the inverse adapted type A def edge is an axiom stated in the source ontology. (type(a−1 ) = type−1 (a)): A detect alignment pair is a verified concept alignment equi if type(x) = equi pair, i.e., O(f1 , f2 ) ≥ θO . An infer edge is a inference disj if type(x) = disj drawn from other matches with the help of transitive closures. type−1 (x) = parent if type(x) = child child if type(x) = parent To populate the graph D, HCM performs the following: onto if type(x) = onto (a) Initialize the alignment graph D with no edges (E = ∅). Second, HCM infers further alignments from a (a−1 re- spectively) by drawing conclusions with the help of exist- (b) Populate D with ontology constraint edges a (cert(a) = ing alignments e ∈ E. HCM supports five types of infer- 1.0 and origin(a) = def ) derived from rdfs:subClassOf , ence types shown in Figure 5: Given an existing alignments owl:equivalentClass, and owl:disjointWith triples from e = (f1 , f2 ) e ∈ E and the alignment just added a = (f2 , f3 ) the underlying RDF data. (respectively a−1 = (f2 , f3 )), a closure a0 = (f1 , f3 ) can be drawn. (c) Add relations a between WCFs originating from the Third, given the verified alignment graphs, HCM outputs same ontology with type(a) = onto, cert(a) = 1.0, and all edges e ∈ E with origin(e) = detect as alignment set A. origin(a) = def . Here, we use the ontology id from a For instance, given a WCF group G = concept URI by removing the ConceptID (see Sec. 2, {dbpedia:SpaceStation, umbel:SpacePlatform Manned , Step 1). dbpedia:Spacecraft} the population of graph D works as follows: The only ontology constraint for G is the alignment (d) Finally, match candidates M (Step 2) are selected greed- aonto = (dbpedia:Spacecraft, dbpedia:SpaceStation), with ily: We add matches a = (f1 , f2 ) from M with decreas- cert(aonto ) = 1, origin(aonto ) = def, and type(aonto ) = onto ing similarity O(f1 , f2 ) to E. New edges in E comprise in order to indicate that both concepts to originate from cert(a) = O(f1 , f2 ), type(a) = equi, and origin(a) = the same ontology. Both alignments aonto and the in- infer as annotations. verse a−1onto are added to D. Afterwards the candidate set G = {a1 , a2 , a3 } is processed. Assuming alignment When inserting an alignment – ontology constraints or can- certainties of: didate alignments (Step 2) – we perform three steps: a1 = (dbpedia:SpaceStation, umbel:SpacePlatform Manned ) First, we check whether an alignment a contradicts the ex- cert(a1 ) = 0.955, isting edges in D. To this end, HCM checks for four different a2 = (dbpedia:Spacecraft, dbpedia:SpaceStation) types of contradictions: Conflicting definitions are align- cert(a2 ) = 0.718, and ments in the graph that are incompatible to an new align- a3 = (dbpedia:Spacecraft, umbel:SpacePlatform Manned ) ment a = (f1 , f2 ). A conflicting definition occurs, if there is cert(a3 ) = 0.341 an edge ex = (f1 , f2 ); ex ∈ E such that type(ex ) 6= type(a), HCM proceeds as follows: First a1 is added to D with origin(ex ) 6= infer , or cert(ex ) ≥ cert(a). Furthermore, origin(a1 ) = detect and type(a1 ) = equi (and so a−1 1 ). Af- 36 terwards, a2 is not added, due to the conflicting definition aonto , which marks both concepts to originate from the 30 same ontology. Finally, due to a multiple-entity correspon- dence a3 is prevented too: umbel:SpacePlatform Manned is 24 runtime in hours already matched to a concept from the dbpedia-ontology. 18 4.2 Experiments In the following section, we discuss the experimental re- 12 sults of HCM for the BTC data as well as on the Ontology Alignment Evaluation Initiative benchmark track [6, 7]. 6 BTC. To determine HCM’s alignment precision for the 0 0 500 1000 1500 2000 2500 3000 3500 4000 BTC data, we executed the alignment generation algorithm number of forests per group using different threshold settings (θO ). We chose seven runtime Poly. (runtime) ranges: simf orest = 1, 1 > simf orest ≥ 0.95, 0.95 > simf orest ≥ 0.9, . . . , 0.75 > simf orest ≥ 0.7. We did not Figure 6: Alignment generation execution time for consider alignments with O(f1 , f2 ) < 0.7 which is along the different WCF group sizes. The blue dotted line in- lines with the argument by Jain et al. for the BLOOMS+’s dicates the quadratic regression line unifying the ac- tree overlap measure [14]. The authors mention an optimal tual alignment generation time per group (indicated threshold of 0.85. After determining alignments using HCM, by a yellow rhomb). The experiment was executed a random sample of 50 alignments for each similarity range with h = 4, d = 10, the ConceptID TFIDF3 keyword was selected. Then three independent annotators manually extraction method, the TFIDFForest10 topic extrac- inspected these alignments and identified the actual relation tor (θJ = 0.7), and the forest threshold θO = 0.7. between aligned concepts in a majority vote process. We used three different types of relations: Equivalence indi- cates a correct alignment of two equivalent concepts. Sim- semantic relation via outlaw motorcycle clubs similar to Ri- ilar indicates a fuzzy match. The two concepts are related vals Hells Angels and Bandidos. and the set of instances of the concepts have a significant in- Runtimes of the alignment generation phase is visual- tersection. Disjoint indicates an incorrect alignment of two ized in Figure 6 for different WCF groups (orange rhomb concepts. The set of instances of the concepts are disjoint. marker). Due to the pairwise comparison of trees of all Table 2 shows our manual evaluation results as well as WCFs in a group G to generate candidate alignments (see their confidence intervals (confidence level of 95%). Addi- Step 2), the alignment generation phase has a quadratic run- tionally, it depicts absolute numbers of alignments found time in the group size |G|. Furthermore, the effort alignment in the BTC data for varying thresholds. In the case of verification is polynomial to the amount of alignment can- 1 > simf orest ≥ 0.95, 57.5% of the alignments were la- didates. The blue dotted line indicates the polynomial re- beled as equivalences and 15% as similar. The number of gression of the forest match execution times. The amount of actual equivalences decreases for smaller forest similarities. small groups is very high (see Figure 3), which is no prob- When considering only the strict equivalence judgments as lem, because of small run-times. A remaining issue is the relevant alignment, the inter-annotator agreement for strict large effort of the matching step for large groups which is equivalences is very high κ = 0.909 (Cohen’s kappa coef- left for future work. ficient). By treating more fuzzy results (similar or equal alignments) as correct, the inter-annotator agreement de- creases to κ = 0.817. This is due to a varying perception O(f1 , f2 ) P (eq) P (eq ∪ sim) align. of a similarity or relatedness. It is much more subjective O = 1.0 0.964 ±0.036 0.964 ±0.036 601 than a strict equivalence definition. For instance, the con- 1.0 > O ≥ 0.95 0.575 ±0.143 0.725 ±0.129 133.317 cepts daml:Ammeters and umbel:Voltmeter are related due 0.95 > O ≥ 0.9 0.481 ±0.145 0.706 ±0.131 72.131 to their common purpose. On the other hand, from a tax- 0.9 > O ≥ 0.85 0.071 ±0.066 0.294 ±0.131 6.921 onomic point of view, the instance sets of both concepts 0.85 > O ≥ 0.8 0.053 ±0.053 0.200 ±0.114 4.279 are probably disjoint. HCM discovered interesting matches 0.8 > O ≥ 0.75 0.053 ±0.053 0.126 ±0.092 3.139 such as: 0.75 > O ≥ 0.7 0.071 ±0.066 0.331 ±0.136 1.950 • yago:PsychoactiveFungi and umbel:HallucinogenicMushroom Table 2: The probabilities and confidence intervals • dbpedia:BirdsOfAustralia and opencyc:Parrot for the 50 random samples of the alignments pro- • umbel:AirConditioner and dbpedia:HeatExchangers posed by HCM (confidence level 95%). The left- Note that, due to diverse and non-trivial domains in the most column shows the different similarity classes BTC corpus, these kind of matches are hard to identify, even (O(f1 , f2 )). The 2nd column from the left presents for humans. Our algorithm can deal with these cases because the probability and confidence interval for an align- it exploits broad knowledge for many domains present in ment representing an equivalence. The 3rd column Wikipedia. from the left shows the probability and confidence Furthermore, it is interesting to see that even in case of interval for an alignment representing either equiv- false positives, the algorithm reveals interesting semantic re- alence or similarity. The rightmost column depicts lations between concepts. For instance, HCM aligns the the total number of identified alignments within the concepts yago:Outlaws and umbel:MotorcycleClub due to a BTC data. OAEI. In order to compare with other ontology alignment lenge here, is to identify approaches that can capture differ- approaches (e.g. [2, 13, 15, 17]), we provide results on the ent semantic notions of an input concept and allow a useful benchmark track of the Ontology Alignment Evaluation Ini- grouping. Also, we plan to derive further relationship types tiative Campaign [6]. The goal of the track is to evaluate - maybe with the help of instance data instead of Wikipedia strengths and weaknesses of different alignment approaches categories. Last, we plan to experiment with other topical by providing a set of alignments between a reference and sys- groupings that, e.g., allow overlaps across groups. Another tematically defamiliarized ontologies. All considered ontolo- grouping method could also support incremental updates gies in the benchmark track originate from the bibliographic and thus facilitate online ontology alignments for the grow- domain. Since our algorithm is designed to match concepts ing web of linked data. from various ontologies at once, we generate alignments for concepts from ontologies in the benchmark track. To deter- References mine the matching quality we extract all concept pairs be- tween ontologies having a reference alignment. HCM treats [1] C. Bizer, T. Heath, and T. Berners-Lee. Linked Data - concepts with equivalent URI as one entity. Therefore we The Story So Far. International Journal on Semantic implicitly infer alignments between concepts with an equiv- Web and Information Systems, 4(2):1–22, 2009. alent URIs. Furthermore, the reference alignments for prop- [2] I. F. Cruz, F. P. Antonelli, and C. Stroe. Agreement- erties are neglected, because HCM does not target property Maker: Efficient matching for large real-world schemas alignments. Using a WCF overlap threshold of θO = 0.7 and ontologies. In Proceedings of the VLDB Endow- leads to a precision of 85% and a recall of 55%. In com- ment, pages 1586–1589, 2009. parison to other approaches that additionally offered align- ments between properties, these numbers are average. This [3] J. David, F. Guillet, and H. Briand. Matching directo- is mainly due to the changes of ConceptIDs and descriptions ries and OWL ontologies with AROMA. In Proceed- in the automatically generated ontologies. For instance, the ings of the International Conference on Information concept xsqlknk of ontology 266 neithers contain a human and Knowledge Management (CIKM), pages 830–831, readable ConceptID, nor a description. Nevertheless, the 2006. reference alignments contain a match with 101:MastersThe- sis. Many state-of-the-art ontology matching algorithms [4] H. H. Do and E. Rahm. Coma - a system for flexible solve this problem by using an structure based similarity combination of schema matching approaches. In VLDB, as sole basis of decision making. However, for the LOD on- pages 610–621, 2002. tology matching task, this approach is unpromising, because different scopes of the authors of LOD data sets lead to large [5] K. Eckert, C. Meilicke, and H. Stuckenschmidt. Im- differences in the ontology structure. For instance, the two proving ontology matching using meta-level learning. data sets of DBpedia12 and YAGO13 represent knowledge In ESWC, pages 158–172, 2009. extracted from Wikipedia. Nevertheless, the ontologies of both data sets vary significantly in structure, whereas many [6] J. Euzenat, A. Ferrara, C. Meilicke, J. Pane, common concepts are shared. F. Scharffe, P. Shvaiko, H. Stuckenschmidt, O. Šváb Za- In 2011 a new bibliographic benchmark data set was gen- mazal, V. Svátek, and C. Trojahn. Results of the Ontol- erated [7]; the results of HCM are the equivalent for the ogy Alignment Evaluation Initiative 2010. In Proceed- new data set. ings of the International Workshop on Ontology Match- ing at ISWC, 2010. 5. CONCLUSION AND FUTURE WORK [7] J. Euzenat, A. Ferrara, W. R. van Hage, L. Hollink, In this paper, we tackle the problem of large-scale con- C. Meilicke, A. Nikolov, F. Scharffe, P. Shvaiko, cept matching – one of the major challenges in ontology H. Stuckenschmidt, O. Šváb Zamazal, and C. Trojahn. matching [19]. In contrast to other approaches that process Final results of the Ontology Alignment Evaluation Ini- pairs of ontologies, our focus is on aligning many ontologies tiative 2011. In Proceedings of the International Work- from the LOD cloud simultaneously in a holistic manner. shop on Ontology Matching at ISWC, 2011. To this end, we propose an abstract workflow (knowledge [8] J. Euzenat and P. Shvaiko. Ontology Matching. extraction, grouping, and alignment generation) to specif- Springer-Verlag, 2007. ically enable scalability while still examining all ontologies in its entirety. Further, we plug state-of-the-art techniques [9] B. He and K. C.-C. Chang. Statistical Schema Match- and novel ideas into this workflow and report on promising ing across Web Query Interfaces. In Proceedings of results for scalability and alignment quality in a web-scale the ACM International Conference on Management of alignment scenario. For representing knowledge, we chose Data (SIGMOD), pages 217–228, 2003. Wikipedia Category Forests. For grouping the input, we leverage topical information. Last, the alignment generation [10] B. He, K. C.-C. Chang, and J. Han. Discovering com- leverages Wikipedia Category Forest overlaps and performs plex matchings across web query interfaces: a correla- a semantic inspection. tion mining approach. In Proceedings of the Conference We have many ideas for future directions: For instance, we on Knowledge Discovery and Data Mining, pages 148– will look into other knowledge representations and matching 157, 2004. techniques that can be plugged into the workflow. The chal- [11] J. Hopcroft and R. Tarjan. Algorithm 447: efficient 12 http://dbpedia.org/ algorithms for graph manipulation. Communications 13 http://www.mpi-inf.mpg.de/yago-naga/yago/ of the ACM, 16:372–378, 1973. [12] J. Huber, T. Sztyler, J. Nößner, and C. Meilicke. Codi: [20] E. Rahm. Towards Large-Scale Schema and Ontology Combinatorial optimization for data integration: re- Matching. In Schema Matching and Mapping, chap- sults for oaei 2011. In Proceedings of the International ter 1, pages 3–27. Springer Berlin / Heidelberg, 2011. Workshop on Ontology Matching (OM), 2011. [21] L. A. Ribeiro and T. Härder. Efficient Set Similarity [13] P. Jain, P. Hitzler, A. P. Sheth, K. Verma, and P. Z. Joins Using Min-prefixes. In Advances in Databases and Yeh. Ontology Alignment for Linked Open Data. In Information Systems, pages 88–102. Springer Berlin / Proceedings of the International Semantic Web Confer- Heidelberg, 2009. ence (ISWC), 2010. [22] D. Ritze and H. Paulheim. Towards an Automatic [14] P. Jain, P. Z. Yeh, K. Verma, R. G. Vasquez, Parameterization of Ontology Matching Tools based M. Damova, P. Hitzler, and A. P. Sheth. Contextual on Example Mappings. In P. Shvaiko, J. Euzenat, Ontology Alignment of LOD with an Upper Ontology: T. Heath, C. Quix, M. Mao, and I. Cruz, editors, Pro- A Case Study with Proton. In Proceedings of the Ex- ceedings of the 6th International Workshop on Ontology tended Semantic Web Conference (ESWC), 2010. Matching, volume 814, pages 37–48, http://ceur-ws. [15] Y. R. Jean-Mary, E. P. Shironoshita, and M. R. org, October 2011. CEUR Workshop Proceedings. Kabuka. Ontology matching with semantic verification. Web Semantics: Science, Services and Agents on the [23] K. Saleem, Z. Bellahsene, and E. Hunt. PORSCHE: World Wide Web, 7:235–251, 2009. Performance ORiented SCHEma mediation. Informa- tion Systems, 33:637–657, 2008. [16] Y. Lee, M. Sayyadian, A. Doan, and A. S. Rosenthal. etuner: tuning schema matching software using syn- [24] W. Su, J. Wang, and F. Lochovsky. Holistic Schema thetic scenarios. The VLDB Journal, 16:97–122, Jan- Matching for Web Query Interfaces. In Proceedings uary 2007. of the International Conference on Extending Database Technology (EDBT), pages 77–94, 2006. [17] J. Li, J. Tang, Y. Li, and Q. Luo. RiMOM: A [25] F. M. Suchanek, S. Abiteboul, and P. Senellart. PARIS: dynamic multistrategy ontology alignment framework. probabilistic alignment of relations, instances, and IEEE Transactions on Knowledge and Data Engineer- schema. Proceedings of the VLDB Endowment, 5:157– ing (TKDE), 21:1218–1232, 2008. 168, 2011. [18] A. Nikolov, V. Uren, E. Motta, and A. de Roeck. Over- coming Schema Heterogeneity between Linked Seman- [26] C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li, tic Repositories to Improve Coreference Resolution. In W. Tian, J. Xu, and R. Li. MapDupReducer: detecting Proceedings of the Asian Conference on The Seman- near duplicates over massive datasets. In Proceedings tic Web, pages 332–346. Springer Berlin / Heidelberg, of the ACM International Conference on Management 2009. of Data (SIGMOD), pages 1119–1122, 2010. [19] S. Pavel and J. Euzenat. Ontology Matching: State [27] C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient sim- of the Art and Future Challenges. IEEE Transactions ilarity joins for near duplicate detection. In Proceed- on Knowledge and Data Engineering (TKDE), 2012, to ings of the International World Wide Web Conference appear. (WWW), pages 131–140, 2008.