Semi-Automatically Generated Hybrid Ontologies for Information Integration Lisa Ehrlinger and Wolfram Wöß Institute for Application Oriented Knowledge Processing Johannes Kepler University Linz, Austria {lisa.ehrlinger | wolfram.woess}@jku.at ABSTRACT be determined previously. Consequently, comparability of Large and medium-sized enterprises and organizations are in heterogeneous data sources has to be enabled. many cases characterized by a heterogeneous and distributed information system infrastructure. For data processing ac- We introduce a twofold approach for improving information tivities as well as data analytics and mining, it is essential integration with benefits for data science workflows, data an- to establish a correct, complete and efficient consolidation alytics or data quality assessment. Firstly, descriptions of the of information. Information integration and aggregation are semantics of heterogeneous information systems are extracted therefore fundamental steps in many analytical workflows. from their metadata and information source ontologies are Furthermore, in order to evaluate and classify the result of generated semi-automatically. This step is fundamental to es- an integrated data query and thus, the quality of resulting tablish comparability between the single information sources data analytics, it is previously necessary to determine the for further processing, as it produces a homogeneous view data quality of each processed data source. on heterogeneous data structures. Secondly, the resulting source ontologies are harmonized by applying ontology sim- This paper aims mainly at the first aspect of the mentioned ilarity measures and automatically integrated to a domain twofold challenge. Both, data dictionary as well as infor- ontology for the entire integration infrastructure. This task mation source content are analyzed to derive the concep- enables comprehensive quality assessment across all infor- tual schema, which is then provided as a machine-readable mation sources and provides additional information about description of the information source semantics. Several existing redundancies. descriptions of the semantics can be integrated to a global view by eliminating possible redundancies and by applying The homogeneous view is achieved by generating a machine- ontology similarity measures. Attributes for data quality readable ontology for each individual information source. metrics are included in the descriptions but not yet deter- This process is based on a common vocabulary in combination mined. The implementation of the presented approach is with a mapping of concepts from different data sources like re- evaluated by extracting the semantics of a specific MySQL lational databases, XML, spreadsheets or NoSQL databases, database, represented as RDF triples. which are presented in Section 2. Furthermore, the gener- ation of an initial ontology and the integration process of several other ontologies is described. For a machine-readable Keywords representation of the ontologies Resource Description Frame- Data Science Workflows, Data Analytics, Data Quality, In- work (RDF), RDF Schema and Web Ontology Language formation Integration, Semantics. (OWL) are used. An integration scenario is performed by a hybrid ontology 1. INTRODUCTION Java implementation for semi-automatically generating on- Fact-based strategic decisions of organizations and enterprises tologies from MySQL databases. For this proof-of-concept are frequently supported by analyzing and interpreting data and the evaluation presented in Section 3, the focus is on that is stored in distributed and heterogeneous information relational databases, since they are still the most widely used sources. To ensure the quality of such decisions, which is data source. Section 4 covers the conclusion and open issues directly depending from the quality of an integrated result as well as further research activities in this field. set, the quality of each participating data source has to 2. SEMI-AUTOMATICALLY GENERATED HYBRID ONTOLOGIES This section introduces a unified vocabulary for the repre- sentation of different types of information sources and the mapping of their concepts to the vocabulary terms. After- wards, it is explained how the resulting data source ontologies are stepwise integrated to a domain ontology for the integra- tion infrastructure. 100 Table 1: Specification of information sources by dsd Relational database XML Schema Spreadsheets Cassandra dsd Vocabulary database document/namespace file keyspace Datasource relation (table) simple-, complexType sheet table/column family Concept attribute attribute, element column/row header attribute/column Attribute relation (table) ReferenceAssociation relation (table) extension, restriction InheritanceAssociation relation (table) complexType table/column family AggregationAssociation primary key key PrimaryKey foreign key keyref ForeignKey RDB specific data type XSD data type spreadsheet data type CQL data type XSD data type 2.1 Vocabulary for Data Source Descriptions AggregationAssociation rdf:type The presented approach uses the domain specific vocabulary rdfs:subClassOf description of a data source (dsd) 1 for the machine-readable representation, which is based on OWL, RDF and RDF ReferenceAssociation rdfs:subClassOf Association Schema. Already defined properties are reused by integrating the World Wide Web Consortium (W3C) recommendations rdfs:subClassOf rdf:type rdf:type dcterms 2 , void 3 and foaf 4 . InheritanceAssociation Concept rdf:type A data source consists of arbitrary many instances of the class rdf:type Concept that represents real-world objects. Concepts can be Attribute rdf:type related to each other by with instances of the class Associa- rdf:type owl:Class tion, which can be further be divided into three subclasses. DatasourceType A ReferenceAssociation describes a regular relationship, rdf:type rdfs:subClassOf void:Dataset for example the employment of a person to a company. An inheritance relationship is modeled with InheritanceAsso- Datasource rdf:type rdf:type ciation and an aggregation with AggregationAssociation. PrimaryKey Properties of concepts and associations are described by the class Attribute, which is defined by a xsd:Datatype. In rdf:type rdf:type rdf:type order to enable modeling of referential integrity in relational ForeignKey databases (RDB), the classes PrimaryKey and ForeignKey foaf:Agent Stakeholder rdfs:subClassOf are implemented, which are assigned to Concept and are composed of several Attributes. The class Stakeholder in- herits from foaf:Agent and is used to describe departments Figure 1: Classes of the dsd vocabulary or persons and their access privileges to data sources or parts of them. Figure 1 depicts the taxonomy of dsd. the child/component concepts are also represented in the Relationships between dsd classes are described by a set of parent/aggregation concepts. OWL object properties and data type properties with the prefix dsd. A Datasource may contain several concepts that 2.2 Description of an Information Source are linked by the hasComponent property. Concepts consist For describing the semantics of a data source, classes of the of attributes, connected by the hasAttribute property, and dsd vocabulary are instantiated with concepts of the origi- include key attributes expressed by hasPrimaryKey and has- nal data model. By using abstract and concrete properties, ForeignKey, which in turn consist of attributes and refer to metadata is applied to those instances. Since the implemen- other keys. tation focuses on relational databases, this type of mapping is described in detail. The other data sources depicted in Inheritance and aggregation associations are composed of Table 1 were selected according to their relevance during the parents and childs (hasParent, hasChild) or aggregations conceptualization phase of this research project. and their components (hasAggregation, hasAggregation- Component). They are modeled as classes instead of rela- A relational database table can not be mapped uniquely to a tions, to describe their completeness and disjointedness with class of dsd as a result of semantic loss. Consequently, it is the Boolean properties isComplete and isDisjoint. An not automatically decidable if a table in a RDB represents a association is disjoint, if all child or component concepts real-world object (e.g. employee), a reference association (e.g. are disjoint. Completeness is given, if all individuals of employee works in company), an inheritance association (e.g. manager inherits from employee modeled in two separate tables) or an aggregation (e.g. table containing company 1 http://ehrlinger.cc/voc/dsd [August 18, 2015] and specific departments). Although a lot of research has 2 http://purl.org/dc/terms [August 18, 2015] already been carried out to solve this issue, no satisfying 3 http://rdfs.org/ns/void [August 18, 2015] solution has been proposed so far. Section 3.1 describes the 4 http://xmlns.com/foaf/spec [August 18, 2015] implemented reverse engineering approach with a reliable 101 automatic detection of concepts and reference associations, ing resource ri in the domain ontology is calculated using whereas the cases InheritanceAssociation and Aggrega- a suitable similarity measure (described in detail in Section tionAssociation require manual rework. 3). Two resources are considered as equal, if the calculated value exceeds a pre-defined threshold. In this case a new The eXtensible Markup Language (XML) is essential for relationship ri owl:sameAs rds is added to the description of data interchange purposes. Table 1 includes a mapping for the integration system. If the similarity value of a concept is XML Schema Definition (XSD). Documents that do no apply greater than the threshold and less than 1.0 (but not equal an explicit schema can be analyzed directly, although the to 1.0), and thus, indicating minor differences, the Concept quality of the representation might be low. of the data source ontology is investigated concerning At- tributes (again using similarity measures) that do not yet Spreadsheets as well as comma-separated values files are exist in the domain model. Those attributes are added to the frequently used for basic import/export functions to exchange domain ontology, whereas attributes already existing in the structured information between heterogeneous information domain ontology, but not in data source ontology, remain in systems or data sources. In the current development level the domain ontology and are not removed. If no correspond- spreadsheets are treated as a single relational table, which ing resource is identified for rds , the name and properties of enables a unique mapping to the dsd vocabulary. this resource are copied to the description of the integration system and again a link to the original resource is added Finally, NoSQL databases are taken into account by exem- with owl:sameAs. plary using Cassandra, which is optimized for storing and processing large amounts of data with high performance 3. IMPLEMENTATION AND EVALUATION requirements. Due to Cassandra’s architecture, it can be This section presents the Java implementation of the hybrid assumed that reference or inheritance associations do not ontology architecture. The evaluation of the proposed ap- exist, although denormalized tables probably represent an proach focuses on relational databases, since they are still the AggregationAssociation. most widely used data source. The implementation firstly generates data source ontologies representing the semantics of Denormalized database tables concern not only NoSQL data- MySQL databases and secondly, consolidates them to an in- bases and spreadsheets, but also relational databases and tegrated domain ontology. The implementation is evaluated entail the risk of redundancy, which is a matter of data source by performing three test scenarios. quality and currently main focus in a follow-up research project. 3.1 Ontology Generation Each information source defines proprietary data types for at- The characterization process of a single relational database is tributes. In order to achieve comparability between instances based on reverse engineering, in order to reconstruct semantic of the class Attribute, it is necessary to initially map data information that was lost through forward engineering. In source specific data types to common XSD data types. For the implementation this task is sub-divided into the following example, Oracle VARCHAR is mapped to xs:string. six steps: 2.3 Generation of the Domain Ontology 1. Metadata about tables, attributes, primary- and for- In order to integrate several heterogeneous information sources eign keys, which are stored in the data dictionary, are to enable comprehensive data analytics, harmonization and extracted and stored in Java classes that represent the consolidation of participating source ontologies is necessary. corresponding concepts. To optimize this process, a global reference ontology created by a domain expert would be desirable, but can not be as- 2. Tables are automatically classified according to the sumed to exist in practice. This leads to the demand of an reverse engineering approach by Lubyte and Teassaries automatic integration. [5]. This approach identifies base relations (mapped to Concept) and relationship relations (mapped to Refer- The proposed process follows a Global-as-view (GaV) ap- enceAssociation), but in some cases no classification proach. According to Lenzerini’s tutorial on data integration can be found due to the constellation of foreign and [4], the global schema is built up on views over the sources primary keys. when modeling an integration system with GaV. This map- ping perfectly supports querying, which is a basic requirement 3. A user is requested to approve the classified tables for analytics. The disadvantage of GaV is its inflexibility, and assign tables that could not be classified automati- because extensions might lead to complex refinements in cally. This step should be performed by a person with the global schema. This drawback is acceptable since the knowledge about the local data source and the dsd automatic recreation of the global ontology is of little effort. vocabulary. The integration starts with any of the data source ontolo- 4. By using the Apache Jena Framework5 , an OWL on- gies that is initially compared with any other data source tology based on the tables and information obtained ontology, thus, building up the integrated ontology by con- from reverse engineering is generated. sidering similarities and differences. In the next iterations all 5. Finally, information about the completeness of associa- remaining ontologies are each compared with the integrated tions can be determined by analyzing the tuples of the domain ontology. For each resource rds of a sequentially 5 added data source description, the similarity to each exist- https://jena.apache.org [August 18, 2015] 102 corresponding association tables. If the aggregation or parent class stores a related tuple (resulting from a Table 2: Automatic classification of Sakila Table Referenced Tables dsd class primary- foreign-key relationship) for every tuple stored actor Concept in a component or child class, the property isComplete is set true, otherwise false. address → [city] Concept category Concept 6. The final description of the data source is stored in city → [country] Concept Turtle6 notation. country Concept film → [language, language] Concept 3.2 Integration Process film actor → [actor, film] RefAssoc8 The automatic establishment of a domain model requires film category → [film, category] RefAssoc8 an arbitrary number of data source descriptions as input language Concept parameters, given by a human user. This list also determines the interpretation order of the single ontologies. The first data source ontology is therefore considered as basis and Magento. Magento9 is a popular open source software for completely copied to the integration namespace. All other e-commerce and uses MySQL for data storage. For the ontologies are step-wise integrated according to the process evaluation parts with higher complexity (compared to Sakila) described in Section 2.3. of the Entity-Attribute-Value (EAV) system of the database schema version 1.7.0.2 were used in order to evaluate the For each concept of a newly added ontology, the similarity description of inheritance relationships. Table 3 shows the to every existing concept in the domain ontology is calcu- automatic classification, clearly depicting the expected result, lated. At the most basic level, two attributes are compared where two tables inheriting from eav_attribute could not be by using the Jaccard coefficient J(a, b) = |a∩b| |a∪b| over the bit assigned to a class automatically. In this case user interaction sets a and b. These sets consist of the attribute’s proper- is necessary in order to manually assign the tables to the ties {dcterms:title, dsd:isNullable, dsd:isOfDataType, corresponding classes. dsd:isAutoIncrement, dsd:isUnique}, where for each com- parison of a property a true/false assertion is made. String Table 3: Automatic classification of Magento data types are checked on sub-string occurrences and other Tables dsd class data types, like integer, Boolean or XML data types are verified according to their equality. catalog category entity Concept catalog category product ReferenceAssociation The similarity calculation between two concepts, associations, ccatalog eav attribute not defined primary- or foreign keys is also performed by using the catalog product entity Concept Jaccard coefficient, but those instances are represented by customer eav attribute not defined a set of their assigned attributes combined with properties eav attribute Concept like dsd:isComplete and dsd:isDisjoint for associations. eav entity type Concept A threshold has to be defined manually, in order to determine the minimum similarity value still identifying two concepts as equal. The threshold should be set to a value between Integration. For the evaluation of the integration procedure, 0.51 and 0.99, whereas data with lower quality requires a two views of the magento database with partially overlapping lower threshold. tables are extracted and represented as ontologies and finally integrated to a domain ontology. As expected, equal concepts 3.3 Evaluation are identified as equal, receiving a Jaccard coefficient value of The implementation is evaluated by extracting metadata 1.0. No mapping is detected for concepts that appear in only information of a MySQL database (Sakila and Magento), one of both data models, and therefore they are added to the transforming it into an ontology and calculating the similarity integrated domain ontology. The table eav_attribute was between resources of data source ontologies. The selected slightly modified by removing an attribute in one of both examples serve as basic proof-of-concept and do not claim to views, and receives therefore a similarity value of 0.9444. cover all possible occurring problems. In-depth evaluations Because the value is greater than the threshold set to 0.8, including larger databases (like ERP, CIM and IRM systems) this table is correctly assigned to the original eav_attribute are planned. by owl:sameAs. 4. RELATED WORK Sakila. Sakila7 is an official MySQL sample database for In the last years, several initiatives proposed ontologies as the administration of a film distributions. During the evalua- beneficial for information integration. Cruz and Xiao [1] tion all tables were automatically assigned correctly to their propagate ontologies as well suited for data integration and corresponding classes in dsd, the most important ones are mention hybrid ontologies as most appropriate approach for depicted in Table 2. This perfect result was achieved due to Local-as-View (LaV) integration systems. The LaV approach the simple database schema of Sakila. permits modifications in the sources without affecting the 6 global ontology [1], which is an essential requirement of the http://www.w3.org/TR/turtle [August 18, 2015] 7 proposed approach. http://dev.mysql.com/doc/sakila/en/ [August 18, 2015] 8 9 Abbreviation for ReferenceAssociation http://magento.com [August 18, 2015] 103 SemWIQ [3] is a generic architecture for integrating different to define similarities as equivalencies, analogous to synonyms types of data sources based on a mediator-wrapper system in and homonyms defined in thesauri and vocabularies. combination with a static mapping influenced by the content level. In contrast, our approach aims at an automatic gener- There is still potential for further improvements, especially in ation of ontologies focusing the concept level and metadata terms of reverse engineering of relational databases, e.g., con- extraction for increasing expressiveness of the represented sidering replacement of substring similarity by a thesaurus model. and performing in-depth research of structural graph similar- ity. Data quality assessment was not part of this work, but is Particular attention is paid to relational databases and their currently in the focus of two follow-up research projects, one reverse engineering, because they are still the most widely dealing with schema quality and the other with data quality used systems for storing data. Although many attempts on tuple level. for reverse engineering a RDB have been proposed so far, a satisfying automatic solution for reverse engineering is not 6. REFERENCES available. Spanos, Stavrou and Mitrou [7] give an overview [1] I. F. Cruz and H. Xiao. The Role of Ontologies in Data of techniques for semantically describing RDBs. Common Integration. Journal of Engineering Intelligent Systems, methods like Relational.OWL10 , R2RML11 or the basic map- 13:245–252, 2005. ping ”table-to-class, column-to-predicate” to RDF graphs [2] J. Euzenat, C. Allocca, J. David, M. D’Aquin, from Tim Berners-Lee are not sufficient for the proposed ap- C. Le Duc, and O. Sváb-Zamazal. Ontology distances for proach, because they represent a simple 1:1 mapping without contextualisation. Contrat, INRIA, Aug. 2009. gaining additional knowledge about the data. For enhancing https://hal.inria.fr/hal-00793450 [August 2015]. the expressiveness of the model without loss of quality, the [3] A. Langegger. A Flexible Architecture for Virtual semi-automatic reverse engineering process introduced by Information Integration based on Semantic Web Euzenat et al. [2] is applied. Concepts. PhD thesis, Johannes Kepler University Linz, 2010. Several tools for calculating the similarity of ontologies are available, for example the Java library OntoSim12 or the Java- [4] M. Lenzerini. Data Integration: A Theoretical based GoodOD Similarity Evaluator13 . Both tools focus on Perspective. In Proceedings of the Twenty-first ACM the comparison of ontology structure and classes, whereas our SIGMOD-SIGACT-SIGART Symposium on Principles approach compares individuals. Due to the common dsd vo- of Database Systems, PODS ’02, pages 233–246, New York, NY, USA, 2002. ACM. cabulary, classes and their properties are represented equally for each ontology and facilitate the integration process with [5] L. Lubyte and S. Tessaris. Extracting ontologies from additional domain knowledge. Based on the evaluation of relational databases. Technical report, In Proc. of the similarity measures in [2] and the availability of detailed 20th Int. Workshop on Description Logics (DL’07), 2007. domain knowledge about the dsd vocabulary, similarity cal- [6] S. Muthaiyah and L. Kerschberg. A Hybrid Ontology culation is performed using the Jaccard coefficient, similar Mediation Approach for the Semantic Web. to the work of Muthaiyah and Kerschberg [6]. International Journal of E-Business Research (IJEBR), 4(4):79–91, 2008. [7] D.-E. Spanos, P. Stavrou, and N. Mitrou. Bringing 5. CONCLUSIONS Relational Databases into the Semantic Web: A Survey. The presented approach describes a semi-automatic seman- Semantic Web, 3(2):169–209, Apr. 2012. tic description of an information integration infrastructure, considering its heterogeneous and distributed information sources. The final result is provided in machine-readable form, ideally suited for subsequent information processing and data analytics. In practice, many data sources come with a simple schema without aggregations or inheritance hierarchies. Therefore, a description can be generated automatically in most cases, because the classes Concept and ReferenceAssociation are classified reliably as proved by the evaluation use cases. The domain ontology, although data quality dependent, definitely provides a structured view with high semantic expressiveness on the entire integration infrastructure. This information is a very important prerequisite for subsequent data processing and data analytics, e.g., redundancy detection, as well as data quality assessment. In order to cope with similarities in struc- ture, we will establish a repository for structure descriptions 10 http://www.dbs.cs.uni-duesseldorf.de/RDF/relational.owl [August 18, 2015] 11 http://www.w3.org/TR/r2rml [August 18, 2015] 12 http://ontosim.gforge.inria.fr [August 18, 2015] 13 https://github.com/goodod/evaluator [August 18, 2015] 104