=Paper=
{{Paper
|id=Vol-2788/oaei20_paper4
|storemode=property
|title=AROA results for OAEI 2020
|pdfUrl=https://ceur-ws.org/Vol-2788/oaei20_paper4.pdf
|volume=Vol-2788
|authors=Lu Zhou,Pascal Hitzler
|dblpUrl=https://dblp.org/rec/conf/semweb/ZhouH20
}}
==AROA results for OAEI 2020==
AROA Results for OAEI 2020∗ Lu Zhou and Pascal Hitzler DaSe Lab, Kansas State University, Manhattan KS 66506, USA {luzhou, hitzler}@ksu.edu Abstract. This paper introduces the results of an ontology alignment system named Association Rule-based Ontology Alignment (AROA) in the Ontology Alignment Evaluation Initiative (OAEI) 2020 campaign. This ontology alignment system focuses on producing simple and com- plex alignment between ontologies that are populated with instance data. This is the second participation of AROA in the OAEI campaign, and it produces the best performance in terms of relaxed F-measure on two benchmarks in complex track, which are populated GeoLink and popu- lated Enslaved. 1 Presentation of the system 1.1 State, purpose, general statement AROA (Association Rule-based Ontology Alignment) system aims to automati- cally generate simple and complex alignment between two and more ontologies. These ontologies are required to have shared common instance data because AROA relies on association rule mining and requires these instances as input to discover interesting relations. After generating a set of association rules, AROA utilizes the simple and complex correspondence patterns that have been widely accepted in the Ontology Matching community [4, 5] to further narrow a large number of rules down to more meaningful ones and finally establishes the align- ments. 1.2 Specific techniques used Figure 1 illustrates the overview of AROA alignment system. In this section, we introduce each step of AROA alignment system along with some concepts that we frequently use in the AROA system, such as association rule mining, FP-growth algorithm, and complex alignment generation. Clean Triple. First, AROA extracts all triples as the format of hSubject, Predicate, Objecti from the source and target ontologies. Each item in a triple is expressed as a web URI. After collecting all of the triples, we clean the data based on the following criteria: we only keep the triples that contain at least one entity under the source or the target ontology namespace or the triples contain rdf:type information, as our algorithm relies on this information. ∗ Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Lu Zhou and Pascal Hitzler Fig. 1. Overview of AROA Alignment System Generate Transaction Database. After the filtering process, we generate the transaction database as the input for the FP-growth algorithm. Let I = {i1 , i2 , . . . , in } be a set of distinct attributes called items. Let D = {t1 , t2 , . . . , tm } be a set of transactions where each transaction in D has a unique transaction ID and contains a subset of the items in I. Table 1 shows a list of transactions corresponding to a list of triples. Instance data can be displayed as a set of triples, each consisting of subject, predicate, and object. Here, subjects represent the identifiers and the set of corresponding properties with the objects represent transactions, which are separated by the symbol “|”. I.e., a transaction is a set T = (s, Z) such that s is a subject, and each member of Z is a pair (p, o) of a property and an object such that (s, p, o) is an instance triple. Generate Typed Transaction Database. Then we replace the object in the triples with its rdf:type1 because we focus on generating schema-level (rather than instance-level) mapping rules between two ontologies, and the type 1 If there are multiple types of the object, it can also combine the subject and predicate as additional information to determine the correct type, or keep both types as two triples. Table 1. Triples and Corresponding Transactions s1 p1 o1 s1 p2 o2 s1 p4 o4 TID Itemsets s2 p1 o1 s1 p1 |o1 , p2 |o2 , p4 |o4 s2 p2 o2 s2 p1 |o1 , p2 |o2 , p3 |o3 , p4 |o4 s2 p3 o3 s3 p1 |o1 , p2 |o2 s2 p4 o4 s3 p1 o1 s3 p2 o2 AROA Results for OAEI 2020† 3 Table 2. Original Transaction Database TID Itemsets x1 gbo:hasAward|y1 , gmo:fundedBy|y2 x2 gbo:hasFullName|y3 , gmo:hasPersonName|y4 x3 rdf:type|gbo:Cruise, rdf:type|gmo:Cruise Table 3. Typed Transaction Database TID Itemsets x1 gbo:hasAward|gbo:Award, gmo:fundedBy|gmo:FundingAward x2 gbo:hasFullName|xsd:string, gmo:hasPersonName|gmo:PersonName x3 rdf:type|gbo:Cruise, rdf:type|gmo:Cruise information of the object is more meaningful than the original URI. If an object in a triple has rdf:type of a class in ontology, we replace the URI of the object with its class. If the object is a data value, the URI of the object is replaced with the datatype. If the object already is a class in ontology, it remains unchanged. Tables 2 and 3 show some examples of the conversion. Generate Association Rules. Our alignment system mainly depends on a data mining algorithm called association rule mining, which is a rule-based machine learning method for discovering interesting relations between variables in large databases [3]. Many algorithms for generating association rules have been proposed, like Apriori [1] and FP-growth algorithm [2]. In this paper, we use FP-growth to generate association rules between ontologies, since the FP-growth algorithm has been proven superior to other algorithms [2]. The FP- growth algorithm is run on the transaction database in order to determine which combinations of items co-occur frequently. The algorithm first counts the number of occurrences of all individual items in the database. Next, it builds an FP-tree structure by inserting these instances. Items in each instance are sorted by de- scending order of their frequency in the dataset so that the tree can be processed quickly. Items in each instance that do not meet the predefined thresholds, such as minimum support and minimum confidence (see below for these terms), are discarded. Once all large itemsets have been found, the association rule creation begins. Every association rule is composed of two sides. The left-hand-side is called the antecedent, and the right-hand-side is the consequent. These rules indicate that whenever the antecedent is present, the consequent is likely to be Table 4. Examples of Association Rules Antecedent Consequent p4 |o4 , p1 |o1 p2 |o2 p2 |o2 p1 |o1 p4 |o4 p1 |o1 4 Lu Zhou and Pascal Hitzler Table 5. The Alignment Pattern Types Covered in AROA System Pattern Category Class Equivalence 1:1 Class Subsumption 1:1 Property Equivalence 1:1 Property Subsumption 1:1 Class by Attribute Type 1:n Class by Attribute Value 1:n Property Typecasting Equivalence 1:n Property Typecasting Subsumption 1:n Typed Property Chain Equivalence m:n Typed Property Chain Subsumption m:n as well. Table 4 shows some examples of association rules generated from the transaction database in Table 1. Generate Alignment. AROA utilizes some simple and complex correspon- dences that have been widely accepted in Ontology Matching community to further filter rules [4, 5] and finally generate the alignments. There are a total of 10 different types of correspondences that AROA covers this year. Table 5 lists all the simple and complex alignment correspondences and corresponding categories. Since the association rule mining might generate a large number of rules, in order to narrow the association rules down to a smaller set, AROA follows these patterns to generate corresponding alignments. For example, Class by Attribute Type (CAT) is a classic complex alignment pattern. This type of pattern was first introduced in [4]. It states that a class in the source ontology is in some relationship to a complex construction in the target ontology. This complex construction may comprise an object property and its range. Class C1 is from ontology O1 , and object property op1 and its range t1 are from ontology O2 . Association Rule format: rdf:type|C1 → op1 |t1 Example: rdf:type|gbo:PortCall → gmo:atPort|gmo:Place Generated Alignment: gbo:PortCall(x) → gmo:atPort(x, y) ∧ gmo:Place(y) In this example, this association rule implies that if the subject x is an indi- vidual of class gbo:PortCall, then x is subsumed by the domain of gmo:atPort with its range gmo:Place. The equivalence relationship can be generated by combin- ing another association rule holding the reverse information. Other simple and complex alignments are also generated by following the same steps. 1.3 Adaptations made for the evaluation AROA is an instance-based ontology alignment system. Therefore, AROA em- beds Apache Jena Fuseki server in the system. The ontologies are first down- loaded from the SEALS repository. And then, AROA uploads and stores the AROA Results for OAEI 2020† 5 Table 6. The Number of Alignments Found on Populated GeoLink Benchmark Alignment Patterns Category Reference Alignment AROA - - - # of Correct Entities # of Correct Relation Class Equiv. 1:1 10 10 10 Class Subsum. 1:1 2 1 0 Property Equiv. 1:1 7 5 5 Property Typecasting Subsum. 1:n 5 3 0 Property Chain Equiv. m:n 26 15 13 Property Chain Subsum. m:n 17 7 0 ontologies in the embedded Fuseki server, which might take some time for this step to load large-size ontology pairs. 2 Results This year, AROA alignment system evaluates its performance on the populated GeoLink benchmark [5, 6] and populated Enslaved benchmark [7]. In the pop- ulated GeoLink benchmark, there are 19 simple mappings, including 10 class equivalence, 2 class subsumption, and 7 property equivalence. And there are 48 complex mappings, including 5 property subsumption, 26 property chain equiv- alence, and 17 property chain subsumption. In the populated Enslaved bench- mark, 15 simple mappings are all class equivalences. And there are 83 complex mappings, including 68 property chain equivalence and 15 property chain sub- sumption. Table 6 and Table 7 list the alignment patterns and categories in the populated GeoLink and populated Enslaved Benchmark with the results of AROA system. We list the numbers of identified mappings for each pattern. There are two dimensions that we can look into the details to understand the performance. The first dimension is the entity identification, which means, given an entity in the source ontology, the system should be able to generate related entities in the target ontology. Another dimension is relationship identification, in which the system should detect the correct relationship between these en- tities, such as equivalence and subsumption. Therefore, we list the number of correct entities and the number of correct relationships in order to understand the strengths and weaknesses of the system. For example, In the Table 6, AROA correctly identifies all 1:1 class equivalence including entity and relationship. AROA also finds one class subsumption alignment, which is the class P ortCall in the GeoLink Base Ontology (GBO) is related to the class F ix in the Ge- oLink Modular Ontology (GMO). However, it outputs the relationship between Table 7. The Number of Alignments Found on Populated Enslaved Benchmark Alignment Patterns Category Reference Alignment AROA - - - # of Correct Entities # of Correct Relation Class Equiv. 1:1 15 11 11 Property Chain Equiv. m:n 68 29 29 Property Chain Subsum. m:n 15 3 0 6 Lu Zhou and Pascal Hitzler Table 8. The Performance Comparison on Populated GeoLink and Populated Enslaved Benchmarks Matcher Populated GeoLink Populated Enslaved (1:1) (1:n) m:n Relaxed Precision Relaxed F-measure Relaxed Recall (1:1) (1:n) m:n Relaxed Precision Relaxed F-measure Relaxed Recall Reference Alignment 19 5 43 - - - 15 0 83 - - - AMLC 13 0 0 0.50 0.32 0.23 12 0 18 0.73 0.40 0.28 AROA 15 3 22 0.87 0.60 0.46 11 0 32 0.80 0.51 0.38 CANARD 15 2 17 0.89 0.54 0.39 3 0 16 0.42 0.19 0.13 P ortCall and F ix as equivalence, which it should be subsumption. Therefore, we count the number of correct entities as 1 and the number of correct relations as 0. This criterion is also applied to other patterns. In the Table 7, AROA de- tects 73% (11 out of 15) of the simple class equivalences and 38% (32 out of 83) of the complex mappings in the populated Enslaved benchmark. In addition, we compare the performance of AROA against other complex alignment systems in Table 8. AMLC, AROA, and CANARD are only three systems can produce complex relations on the complex benchmarks. AROA found the highest number of complex alignments and achieved the best performance in terms of relaxed recall and relaxed f-measure on both benchmarks.2 3 General comments From the performance comparison, AMLC, AROA, and CANARD can generate almost correct complex alignment, which means some alignments found by these two systems may not be completely correct, but it can be easily improved by semi-automated fashion. For example, the system can produce correct entities that should be involved in a complex alignment, but it doesn’t output the cor- rect relationship. Another possible situation is that the system can detect the correct relationship but fails to find all the entities. Based on these situations, we will investigate the incorrect alignments and improve the algorithm to find the relationship and entities as accurately as possible. 4 Conclusions This paper introduces the AROA ontology alignment system and its preliminary results in the OAEI 2020 campaign. This year, AROA evaluates its performance on populated GeoLink and populated Enslaved benchmarks and achieves the best performance in terms of relaxed recall and relaxed f-measure among the three complex alignment systems. We will continue to evaluate AROA on other benchmarks and improve the algorithm in the near future. 5 Acknowledgement This work has been supported by the National Science Foundation under Grant No. 2033521, KnowWhereGraph: Enriching and Linking Cross-Domain Knowl- 2 http://oaei.ontologymatching.org/2020/results/complex/geolink/index.html AROA Results for OAEI 2020† 7 edge Graphs using Spatially-Explicit AI Technologies and the Andrew W. Mel- lon Foundation through the Enslaved project (identifiers 1708-04732 and 1902- 06575). References 1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) VLDB’94, Proc. of 20th International Conference on Very Large Data Bases, September 12- 15, 1994, Santiago de Chile, Chile. pp. 487–499. Morgan Kaufmann (1994), http://www.vldb.org/conf/1994/P487.PDF 2. Han, J., et al.: Mining frequent patterns without candidate gener- ation: A frequent-pattern tree approach. Data Min. Knowl. Discov. 8(1), 53–87 (2004). https://doi.org/10.1023/B:DAMI.0000005258.31418.83, https://doi.org/10.1023/B:DAMI.0000005258.31418.83 3. Piatetsky-Shapiro, G.: Discovery, analysis, and presentation of strong rules. In: Knowledge Discovery in Databases, pp. 229–248. AAAI/MIT Press (1991) 4. Ritze, D., Meilicke, C., Sváb-Zamazal, O., Stuckenschmidt, H.: A pattern-based on- tology matching approach for detecting complex correspondences. In: Shvaiko, P., Euzenat, J., Giunchiglia, F., Stuckenschmidt, H., Noy, N.F., Rosenthal, A. (eds.) Proceedings of the 4th International Workshop on Ontology Matching (OM-2009) collocated with the 8th International Semantic Web Conference (ISWC-2009) Chan- tilly, USA, October 25, 2009. CEUR Workshop Proceedings, vol. 551. CEUR-WS.org (2009), http://ceur-ws.org/Vol-551/om2009 Tpaper3.pdf 5. Zhou, L., et al.: A complex alignment benchmark: Geolink dataset. In: Vrandecic, D., Bontcheva, K., Suárez-Figueroa, M.C., Presutti, V., Celino, I., Sabou, M., Kaffee, L., Simperl, E. (eds.) The Semantic Web - ISWC 2018 - 17th International Seman- tic Web Conference, Monterey, CA, USA, October 8-12, 2018, Proceedings, Part II. Lecture Notes in Computer Science, vol. 11137, pp. 273–288. Springer (2018). https://doi.org/10.1007/978-3-030-00668-6 17, https://doi.org/10.1007/978-3-030- 00668-6 17 6. Zhou, L., Cheatham, M., Krisnadhi, A., Hitzler, P.: Geolink data set: A complex alignment benchmark from real-world ontology. Data Intell. 2(3), 353–378 (2020). https://doi.org/10.1162/dint a 00054, https://doi.org/10.1162/dint a 00054 7. Zhou, L., Shimizu, C., Hitzler, P., Sheill, A.M., Estrecha, S.G., Foley, C., Tarr, D., Rehberger, D.: The enslaved dataset: A real-world complex ontology align- ment benchmark using wikibase. In: d’Aquin, M., Dietze, S., Hauff, C., Curry, E., Cudré-Mauroux, P. (eds.) CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19- 23, 2020. pp. 3197–3204. ACM (2020). https://doi.org/10.1145/3340531.3412768, https://doi.org/10.1145/3340531.3412768