PKBD.Onto: A Plugin for Ontological Schemas Generation Nikita Dorodnykh[0000-0001-7794-4462], Aleksandr Yurin[0000-0001-9089-5730] and Anastasia Vidiya Matrosov Institute for System Dynamics and Control Theory, Siberian Branch of Russian Academy of Sciences, Lermontov St. 134, Irkutsk, Russia tualatin32@mail.ru, iskander@icc.ru Abstract. The use of Semantic Web technologies (including ontologies) for in- telligent systems and knowledge bases engineering is a widespread practice, it is true especially for tasks of conceptualization and formalization. However, tools and approaches used for these tasks in most cases provide only a manual manipulation of concepts and relationships. In this regard, the use of various in- formation sources for automated ontology engineering is relevant. One of these sources is spreadsheets. In this paper, we propose an approach for the automat- ed creation of ontological schemas based on the analysis and transformation of spreadsheets data. The feature of our approach is the original relational canoni- calized form of spreadsheets. This form is used for preprocessing spreadsheets and unifying the input data. The proposed approach is implemented in the form of a plugin (PKBD.Onto) for Personal Knowledge Base Designer - software for prototyping rule-based expert systems. The main stages of the approach, the ar- chitecture and functions of the plugin, and the case study are also described. Keywords: Spreadsheets, Canonical Spreadsheet, Ontological Schema, OWL, Model Transformation, Code Generation 1 Introduction The use of Semantic Web technologies, including ontologies [8], for intelligent sys- tems and knowledge bases engineering is a widespread practice. In most cases, ontol- ogies and special software (e.g., Protégé, ONTOedit, Menthor Editor, Semaphore Ontology Editor, OntoStudio, WebOnto, Fluent Editor, etc.) are used by analysts and domain experts for tasks of knowledge conceptualization and formalization. However, these tools provide a weak integration with external information sources (e.g., data- bases, texts, tables, conceptual models, etc.) in terms of importing domain concepts and relationships. This fact reduces the efficiency of the ontology engineering pro- cess. One of the information sources that can be used for the automated creation of ontologies is spreadsheets. Today, a large volume of arbitrary tables has been accu- mulated worldwide [9] and presented in the spreadsheet-like formats (HTML, Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). EXCEL, and CSV). Arbitrary tables are a valuable data source in business intelli- gence and data-driven research. In our previous papers [5, 14] we proposed an approach for automated analysis and transformation of spreadsheets into conceptual domain models in the form of UML class diagrams. In this paper, we propose to apply this approach for ontological sche- mas generation (ontologies at the TBox level) in the OWL2 DL format [6]. A feature of the proposed approach is the use of the original canonicalized form for representa- tion of spreadsheets, which provides the unification of input data. Our approach is implemented in the form of the plugin, namely, PKBD.Onto, for Personal Knowledge Base Designer (PKBD) [15] – software for prototyping rule- based expert systems. A case study for the proposed approach and the plugin descrip- tion are also presented. 2 Background 2.1 Method for Spreadsheets Transformation A big volume of arbitrary tables (e.g. cross-tabulations, invoices, roadmaps, and data collection forms) circulates in spreadsheet-like formats. Spreadsheets are character- ized by a wide variety and heterogeneity of layouts, styles and content. This diversity determines the two groups of solutions for their processing:  ad-hoc solutions, which are oriented on certain layout or domain;  universal solutions that use pre-processing arbitrary spreadsheets to transform them into some unified canonicalized format for further automated processing [12, 13]. This work uses the results of the project for the development of a framework for creating systems of data extraction from arbitrary spreadsheets. It is a reason why we use the TabbyXL [11] canonical spreadsheet format designed for the analysis of spreadsheets from Industrial Safety Inspection (ISI) reports [14]. We use the following spreadsheet structure in a canonicalized form: CS CSV = {D, RH , CH }, (1) where D is a data block that describes literal data values (named “entries”) belonging to the same data type (e.g., numerical, textual, etc.); RH is a set of row labels of the category; CH is a set of column labels of the category. The values in cells for heading blocks can be separated by the “|” symbol to divide categories into subcategories. Thus, the canonical table denotes hierarchical relationships between categories (head- ings). The analysis of canonicalized spreadsheets is carried out line by line. At the same time, cells can contain several values (concepts) with the separator (“|”). A cell value with the separator (“|”) is interpreted as a hierarchy of classes (concepts) or attributes (properties). In our approach we use the following heuristic-based rules [14] for the transfor- mation of arbitrary spreadsheets to a canonicalized form: Rule 1: IF RH corresponds only one CH THEN RH transformed to a class with properties from CH. Rule 2: IF RH corresponds only one CH and at the same time RH contains two values with the separator (“|”) THEN RH transformed to a class with properties from CH and with an additional property "Name" that corresponds to RH-2. Rule 3: IF RH contains two values with the separator (“|”) and they correspond to two CH values with the separator (“|”), THEN RH transformed to the first class, CH transformed to the second class and a relationship stated between them. Rule 4: IF RH corresponds to three CH values with the separator (“|”), THEN RH transformed to the first class with properties CH-1, and CH 2 and 3 transformed to the second class and a relationship stated between them. All obtained parent-child relationships are interpreted as the association and the cardinality of “1..*” is determined by default. By default attribute values are set based on the D column. The main results of this algorithm are fragments of conceptual models. These fragments need to aggregate, including operations for clarifying the names of con- cepts, their properties and relationships, and also their possible merging and separa- tion. The following rules used for automatic aggregation of conceptual models frag- ments: Rule 1: Merge two classes when they have equal names from duplicate fragments of class diagrams. Rule 2: Merge two classes when they have the same structure, i.e. when sets of at- tributes are equal. In this case, only the first class with this structure stays in the mod- el. Rule 3: Merge two classes when they have similar names. The resulting fragments of class diagrams can describe the same objects or processes. We suggest using a simple string comparison method based on the Levenshtein distance [10] to determine the similarity between two names of classes. If the distance is less than or equal to three, then we assume the classes to be similar. Note that this is not enough, so we also look at the structure of classes (names of attributes must partly match). Rule 4: Create a new association between two classes if homonymous classes and attributes exist. In this case, a name in one class is equivalent to the attribute name in another class. At the same time, the attribute of the same name is removed. Rule 5: Remove duplicate associations between classes. Manual merging and separation operations are performed by using PKBD. 2.2 PKBD: a Tool for Knowledge Base Engineering We used PKBD when solving problems of knowledge bases of expert systems engi- neering, in particular, in the field of ISI [1]. PKBD is implemented as a desktop appli- cation designed for non-programmers. The main purpose of PKBD is to prototype knowledge bases that use the formalism of logical rules. One of PKBD features is a support of the Rule Visual Modeling Language (RVML) [7]. RVML is considered as a UML extension. Other PKBD features are the following: • a modular architecture that provides the ability to add modules for sup- porting knowledge programming languages. Currently, CLIPS and Drools are supported; • integrability with conceptual modeling tools when importing and export- ing concepts and relationships. The PKBD architecture determines the interaction of the following main software components: • a knowledge base management module, it provides storage of projects in the EKB format (the proprietary XML-like format); • a user interface subsystem includes the following modules: software wiz- ards for manipulating knowledge base elements, a GUI generation, a Tiny RVML editor; • a subsystem for supporting programming language modules, it provides connection and disconnection of modules, access to their functions for generating program codes; • a module of integration with conceptual models sources: IBM Rational Rose, StarUML, XMind, CMapTools, and TabbyXL; • a rule engines control module provides activation of rule engine for test- ing knowledge bases; • a module of interaction with the web-based software called Knowledge Base Development System (KBDS) [3]. Main functions of PKBD are: • designing elements of rule bases (fact templates, facts, and rules) by non- programmers using a set of wizards and defined sources of conceptual models; • checking the integrity of the developed knowledge bases (syntactic and semantic control); • representing knowledge base elements using RVML; • generating knowledge base codes in the CLIPS format; • testing developed knowledge base codes (logical inference) using the in- tegrated CLIPS rule engine; • integrating with CASE-tools: IBM Rational Rose, StarUML, XMind, and CMapTools, regarding import and transformation of conceptual models in order to highlight the main entities (concepts) and relationships for creat- ing knowledge base drafts; • integrating with TabbyXL [11] in terms of import and transformation of canonical spreadsheet tables in order to highlight the main entities (con- cepts) and relationships for creating knowledge base drafts; • interacting with the KBDS service. We used PKBD as an open software platform and developed a PKBD.Onto plugin. This plugin implements our approach for ontological schemas generation in the OWL2 DL format. 3 Proposed Approach 3.1 Method The method for generating ontological schemas is based on principles of a model transformation. A model transformation is one of the key concepts in Model-Driven Engineering (MDE) [2]. From a formal point of view our method can be represented as a chain of horizon- tal exogenous transformations: T : CS CSV → CM XML → OS OWL , (2) CSV where CS is a source spreadsheet presented in a canonicalized form and saved in CSV format using TabbyXL. The structure of a canonical spreadsheet is described XML in Section 2.1; CM is a conceptual model resulted from spreadsheet transfor- mation, which is a form for the internal representation of domain concepts and rela- OWL tionships for PKBD; OS is a target ontological schema in the OWL2 DL format. XML Using (2), let’s describe CM in more detail: XML CM = C , DT , RL , where C is a set of classes; DT is a set of datatypes; RL is a set of relationships between C . Let’s refine C from (3) as follows: C = {c1...cn }, ci = namei , ATi , i = 1, n , when namei is a class name; ATi is a { } set of class attributes, ATi = ai ,1 ,..., ai ,k , ai , j = name j , type j , value j , j ∈1, k , when name j is an attribute name; type j is an attribute datatype, type j ∈ DT ; value j is a possible attribute value. RL = {rl1...rln }, rli = namei , typei , lhsi , rhsi , i = 1, n , when typei is a relationship type (inheritance, dependency, association, aggregation, composition, realization); namei is a relationship name; lhsi is a left side of a relationship, lhsi = namelhs , cd lhs , c j , when namelhs is a name of a class role at the left relation- ship side, cd lhs is a cardinality of the left relationship side, c j is a link of a class at the left relationship side, c j ∈ C ; rhsi is a right side of a relationship, rhsi = namerhs , cardinalityrhs , ck , when namerhs is a name of a class role at the right relationship side, cd rhs is a cardinality of the right relationship side, ck is a link of a class at the right relationship side, ck ∈ C . Wherein, cdlhs , cd rhs = {0,0..1,0..*,1,1.. *} . OWL Using (2), let’s describe OS in more detail: OWL OS = C , OP, DP, DT , when C is a set of classes; OP is a set of object properties; DP is a set of datatype properties; DT is a set of XML Schema datatypes. A detailed description of the OWL 2 DL specification is given in [6]. CSV Analysis and transformation of source spreadsheets ( CS ) and formation of a XML conceptual model ( CM ) are discussed in detail in [5, 14]. In this paper, we will OWL describe in detail how to obtain ontological schemas ( OS ). For this, using (2), let’s describe a transformation operator ( T ): T = TCS − CM , TCM − OSM , TOSM − OS , TCS −CM : CS CSV → CM XML , TCM − OSM : CM XML → OSM , TOSM − OS : OSM → OS OWL , where TCS −CM is a set of rules for transformation of a source spreadsheet in the CSV format into a conceptual model, for example, a UML class diagram; TCM −OSM is a set of rules for transformation of a conceptual model into an ontological schema mod- el; TOSM −OS is a set of rules for transformation of an ontological schema model into OWL ontology code at the TBox level. Wherein: OSM is an ontological schema model designed for a unified representa- tion and storage of knowledge extracted from various information sources. This mod- el abstracts from features of knowledge representation languages and their dialects used for the implementation of ontologies (e.g., OWL, RDFS, etc.). So, using sets of transformation rules ( TCM −OSM and TOSM −OS ), ontological OWL schemas generation ( OS ) includes four main stages. Stage 1: Analysing and transforming an XML structure of PKBD internal knowledge representation for conceptual models. This stage involves extracting ele- ments, their properties, and relationships from an XML tree (the depth-first search for elements). Stage 2: Forming an ontological schema model. The main objective of this stage is obtaining typical ontological fragments in the form of a set of classes and their rela- tionships (object and datatype properties), which describe a certain domain and based on the extracted XML elements. Stage 3: Generating an ontological schema code in the OWL format based on an ontological schema model. Transformations themselves can be described using special transformation lan- guages, for example, Transformation Model Representation Language (TMRL) [4]. In this work, we use a general-purpose language to implement transformations. Moreo- ver, all transformations can be represented in tabular form (Table 1). Table 1. Main correspondences between elements of a conceptual model, an ontology schema model, and OWL constructions. CM OSM OWL Model Ontology owl:Ontology Class Class owl:Class Generalization (class) Class (superclass) rdfs:subClassOf Class (name) Class (name) rdf:about Association Relationship owl:ObjectProperty owl:ObjectProperty AssociationEnd (class) Rhs (rdfs:domain) owl:ObjectProperty AssociationEnd (class) Lhs (rdfs:range) Attribute Property owl:DatatypeProperty owl:DatatypeProperty Attribute (name) Property (name) (rdfs:domain) owl:DatatypeProperty Attribute (value) Property (value) (rdfs:range) Attribute (description) Property (description) rdfs:comment Stage 4: Editing an obtained ontological schema. This stage is additional and rep- resents a refinement (modification) of OWL code obtained with the aid of various ontological modeling tools, for example, Protégé and others. So, the main result of these stages is a set of ontology classes and their properties, which define an ontological schema at the TBox level. 3.2 PKBD.Onto: a Plugin for PKBD The PKBD.Onto plugin is implemented in the form of a Dynamic Link Library (DLL) that is dynamically connected via a unified PKBD API. The unified PKBD API for supporting integration modules with external software in terms of import and export contains three functions: • getting a description of DLL including name and version (“DllInfo” func- tion); • getting a detailed description of DLL (“About” function); • executing a main function of DLL, while a conceptual model in the PKBD format, a resulting file name, and a list of possible parameters are passed as a parameter (“Execute” function). In the PKBD.Onto plugin architecture (Fig. 1) can be distinguished following components: • supporting a PKBD format of conceptual models, which provides access and manipulation of model elements; • transforming the input model to the OWL2 DL format; • transforming the input model to a set of linked data in the RDF format (can be viewed as a mean for obtaining a set of specific facts). XML PKBD Parser DllInfo OWL DL Generator About RDF Generator Execute Fig. 1. A PKBD.Onto plugin architecture. 3.3 Case Study Currently, PKBD is used in the educational process at Irkutsk National Research Technical University (IrNRTU), Institute of Information Technology and Data Sci- ence. Therefore, as an example, let’s consider the educational task of developing an ontological schema fragment. Information on minerals in the form of arbitrary spreadsheets is used as source data (Fig. 2). To unify the input data, a source arbitrary spreadsheet was preprocessed and a canonical spreadsheet resulted (Fig. 3). Next, the canonical spreadsheet is analyzed using PKBD, in particular, by the PKBD.Onto plugin. Conceptual model elements are extracted as a result of this analy- sis. These elements can be visually represented as an RVML schema (Fig. 4). The obtained model requires modification, namely, all minerals were aggregated into a “Diamond” class (template), which must be renamed to “Mineral”. Fig. 2. An example of a source arbitrary spreadsheet (before preprocessing). Fig. 3. A fragment of a canonical spreadsheet. Fig. 4. A conceptual model in the form of a RVML schema resulted from the analysis of a canonical spreadsheet. Based on the modified conceptual model (Fig. 4), we generated the code of the onto- logical schema in the OWL format. Then, this code can be verified in Protégé (Fig. 5). Fig. 5. A fragment of the ontological schema (Protégé). 4 Conclusions In this paper, we describe a method and tool for ontological schemas generation (on- tologies at the TBox level) in the form of a plugin for Personal Knowledge Base De- signer. Spreadsheets reduced to a canonicalized form and saved in the CSV format were used as source data. Resulting OWL ontology codes are syntactically correct and can be evaluated by end-users. The PKBD.Onto plugin allows one to create rapid prototypes of spreadsheet-based ontologies for a specific domain. Modified and refined ontologies can be used for intelligent systems and knowledge bases engineering [1]. 5 Acknowledgments This work was financially supported by the Council for Grants of the President of Russia (grant No. MK-1647.2020.9), Program of the Fundamental Research of the Siberian Branch of the Russian Academy of Sciences, project no. IV.38.1.2 (reg. no. АААА-А17-117032210079-1), project no. IV.38.1.3 (reg. no. АААА-А17- 117032210077-7). Results are achieved using the Centre of collective usage «Inte- grated information network of Irkutsk scientific educational complex». References 1. Berman, A.F., Nikolaichuk, O.A., Yurin, A.Yu., Kuznetsov, K.A.: Support of Decision- Making Based on a Production Approach in the Performance of an Industrial Safety Re- view. Chemical and Petroleum Engineering 50(1-2), 730–738 (2015). DOI: 10.1007/s10556-015-9970-x 2. Da Silva, A.R.: Model-driven engineering: A survey supported by the unified conceptual model. Computer Languages, Systems & Structures 43, 139–155 (2015). DOI: 10.1016/j.cl.2015.06.001 3. Dorodnykh, N.O.: Web-based software for automating development of knowledge bases on the basis of transformation of conceptual models. Open Semantic Technologies for In- telligent Systems 1, 145–150 (2017). 4. Dorodnykh, N.O., Yurin, A.Yu.: A domain-specific language for transformation models. CEUR Workshop Proceedings (ITAMS-2018) 2221, 70–75 (2018). 5. Dorodnykh, N.O., Yurin, A.Yu., Shigarov, A.O.: Conceptual Model Engineering for In- dustrial Safety Inspection Based on Spreadsheet Data Analysis // Communications in Computer and Information Science. Modelling and Development of Intelligent Systems (MDIS 2019) 1126, 51–65 (2020). DOI: 10.1007/978-3-030-39237-6_4 6. Grau, B.C., Horrocks, I., Motik, B., Parsia, B., Patel-Schneider, P., Sattler, U.: OWL 2: The next step for OWL. Web Semantics: Science, Services and Agents on the World Wide Web 6(4), 309–322 (2008). DOI: 10.1016/j.websem.2008.05.001 7. Grishenko, M.A. Dorodnykh, N.O., Nikolaychuk, O.A., Yurin, A.Yu.: Designing rule- based expert systems with the aid of the model-driven development approach. Expert Sys- tems 35(5), 1–23 (2018). DOI: 10.1111/exsy.12291 8. Guarino, N.: Formal Ontology in Information Systems. In: the First International Confer- ence on Formal Ontology in Information Systems (FOIS’98) 46, 3–15 (1998). 9. Lehmberg, O., Ritze, D., Meusel, R., Bizer, C.: A large public corpus of web tables con- taining time and context metadata. Proceedings 25th International Conference Companion on World Wide Web, 75–76 (2016). DOI: 10.1145/2872518.2889386 10. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Tech. Rep. 8, Soviet Physics Doklady (1966). 11. Shigarov, A.O., Khristyuk, V.V., Mikhailov, A.M.: TabbyXL: Software platform for rule- based spreadsheet data extraction and transformation. SoftwareX 10, 100270 (2019). DOI: 10.1016/j.softx.2019.100270 12. Shigarov, A.O., Mikhailov, A.A.: Rule-based spreadsheet data transformation from arbi- trary to relational tables. Information Systems 71, 123–136 (2017). https://doi.org/10.1016/j.is.2017.08.004 13. Tijerino, Y.A., Embley, D.W., Lonsdale, D.W., Ding, Y., Nagy, G.: Towards Ontology Generation from Tables. World Wide Web 8(3), 261–285 (2005). DOI: 10.1007/s11280- 005-0360-8 14. Yurin, A.Yu., Dorodnykh, N.O.: A Reverse Engineering Process for Inferring Conceptual Models from Canonicalized Tables. Proceedings of the 2019 International Multi- Conference on Engineering, Computer and Information Sciences (SIBIRCON) 485–490 (2020). DOI: 10.1109/SIBIRCON48586.2019.8958458 15. Yurin, A.Yu., Dorodnykh, N.O.: Personal knowledge base designer: Software for expert systems prototyping. SoftwareX 11, 100411 (2020). DOI: 10.1016/j.softx.2020.100411