Ontology Search by Categorization Power Ondřej Zamazal and Vojtěch Svátek Department of Information and Knowledge Engineering, University of Economics, W. Churchill Sq.4, 130 67 Prague 3, Czech Republic {ondrej.zamazal|svatek}@vse.cz Abstract. The demo paper presents novel functionality of ontology search based on categorization power as extension to the OOSP tool. We explain the approach and introduce two scenarios supported. 1 Introduction and Motivation When searching for a suitable ontology, knowledge engineers commonly use key- word search over different ontology entities such as classes, properties and in- dividuals.1 This approach generally yields ontologies that are to some degree relevant to the submitted terms. However, there are also finer-grained cases of search. Ontologies are often used for entity categorization, where instances known to belong to a general class are to be partitioned to its finer subcate- gories. For example, an individual John known to be a Person can be further categorized as a Man, or as someone who has a wife (via the compound concept expression hasWife some owl:Thing), provided the respective entities are avail- able in the ontology. We thus might be interested not only in ontologies having a class named Person but specifically in ontologies having Person as focus class: one for which such (preferably, rich) subcategorization can be expressed using the ontology’s entities. The search should then reflect the categorization power of the ontology wrt. particular focus classes such as Person, expressed in terms of the number of available categorization options (COs). While a formal model underpinning the notion of categorization power is forthcoming, this paper presents, in intuitive terms, preliminary support for the mentioned kind of search as implemented in the extension of the pre-existing ‘Online Ontology Set Picker’ (OOSP)2 tool [5]. We demonstrate its functionality on two scenarios of recommending ontologies based on their focus classes: – Sc1: For a given keyword, focus classes are found (by localname) and lists of their COs are shown for each ontology where they appear. This function- ality supports the collection of explicit relevance feedback in order to gather training data for future recognition of ‘meanigful’ COs. – Sc2: The overall number (focus) classes with number of COs reaching a cer- tain threshold is calculated per ontology, together with other ontology metrics such as number of classes, taxonomy depth etc.; optionally, the number of COs for a given focus class can be computed as well. 1 See, e.g., the Watson engine, http://watson.kmi.open.ac.uk/. 2 http://owl.vse.cz:8080/OOSP/ In the remainder, Section 2 briefly explains the main notions; Section 3 in- troduces the new functionality of OOSP; Section 4 describes related work, and, finally, Section 5 wraps up the paper with conclusions and future work. 2 Categorization Power of Focus Classes Most obvious categorization options (COs) in ontologies are named subclasses of the given focus class (FC). However, further COs may be constructed as com- pound concept expressions. Since the number of all COs under standard OWL expressiveness would often be infinite (e.g., considering arbitrary cardinality or unlimited recursion of properties), we currently restrict the scope of COs under consideration to those built using a few simple, common concept expression pat- terns, yielding a finite number of COs for every FC. In addition, only some COs are meaningful, reusable categories, which we denote as ontologistic3 categories (OCs). While the OC status of COs may sometimes be context-dependent or subjective to some degree, we believe that it is worth accounting for. For ex- ample, let us consider4 an ABox consisting of one class instantiation and three property assertions: Individual: John Types: Man Facts: bornIn UK Facts: hasWife Mary Facts: insuranceCategory Enterpreneur Let us examine the candidate COs for the Person FC built using different struc- tural patterns with respect to being OCs. The named class Man (P1 pattern) is clearly an OC. The compound expression hasWife value Mary, i.e. having Mary (a specific person) as wife, is not an OC due to its low reusability; on the other hand, having a wife in general (hasWife some owl:Thing, P2 pattern) or, possibly, having a wife who is doctor (P3 pattern) is an OC. Being born in the UK (i.e. a specific country, P4 pattern) would be an OC in many contexts; being born in general (P2) should not (since it should be a functional property holding for every instance of Person. Finally, insuranceCategory with a spe- cific value (again, P4 pattern) is clearly an OC; in contrast, insuranceCategory some owl:Thing (P2) may only be an OC under the assumption that being in- sured is not mandatory in the given context and that the fact of non-insurance is manifested by the absence of any insuranceCategory assertion. In this preliminary work we attempt to approximate the categorization power for focus classes as computed from the ontology TBox (focused categorization power, FCP). We selected five variants of CO construction, of which four (1–4) simply follow the patterns P1 to P4, while the additional variant 5 is derived from P4 but the individual i is not part of the ontology itself but of an associated SKOS5 codelist. For each variant we consider a specific formulation of the FCP 3 While ‘ontological category’ might read as ‘category expressible using an ontology’ (possibly with a complex, unintuitive descriptions), ‘ontologistic category’ refers to presumed plausibility of the CO (as reusable domain concept) for human ontologists. 4 In Manchester OWL syntax, http://www.w3.org/TR/owl2-manchester-syntax/. 5 https://www.w3.org/2004/02/skos/ V.(P.) f cpvar (FC) 1 (P1) |{C; C SubClassOf : F C}| 2 (P2) |{P ; P Domain : F C}| 3 (P3) |{(P, C); ∃D P Domain : F C ∧ P Rangea D ∧ C SubClassOf : D}| 4 (P4) |{(P, i); ∃C, D P Domain : F C ∧ P Rangea D ∧ C SubClassOf : D ∧ i T ypes : C| 5 (P4) |{(P, i); ∃s P Domain : F C ∧ P Rangea skos : Concept ∧ ∧ P Rangea value(skos : inScheme, s) ∧ i skos : inScheme s ∧ i T ypes : skos : Concept| Table 1. Detection formulas of COs. V. refers to variant and P. refers to pattern. measure at the level of the whole ontology, denoted as f cp1 to f cp5 , Table 1. Note that these FCP are only (often quite imprecise) approximations of ‘ground truth’ FCP that would only take into account ‘true’ OCs approved by some consensus of human ontologists. We also prune candidate COs whose ineligibility as OC follows from the ontology structure, e.g., for pattern 2 we prune those properties P that appear in an existential restriction F C v ∃P.C; for such properties the CO would contain all instances of the FC.6 For detection we always use the inferential closure of the ontology, e.g., in order to get inferred domain or inferred subclasses, however, with the exception of the range axiom, which is only considered as asserted (therefore the ‘a’ subscript) – otherwise not only subclasses of D but also its superclasses would be inferred as range of P . Because we use inference we employed OWLAPI framework instead of using SPARQL. 3 Implementation in OOSP Currently, ontology search based on focus classes is available for three ontology collections, referred to as ‘pools’ in OOSP. LOV 7 is a well-curated collection of linked open vocabularies used in the Linked Data Cloud. Out of the 529 ontolo- gies (Jan. 2016 snapshot) 1 was not parseable by OWL-API and 19 ontologies were not processable due to unavailable imports. In all, our Jan. 2016 snapshot contains 509 LOV ontologies (96%). NanJing Vocabulary Repository 8 (NJVR) is a vocabulary repository extracted from the index of the Falcons search en- gine.9 The latest release is from June 2015 and it consists of 1763 vocabularies from which 135 were not parseable by OWL-API or were not processable due to unavailable imports. In all, our Jan. 2016 snapshot contains 1628 vocabularies divided into Nanjing vocabularies extracted from single files (1403) and Nanjing merged vocabularies extracted from more than one RDF file (225).10 Finally, 6 We plan to present a rigorous and evaluated framework for FCP in the future, which will benefit from the usage of this application. For more explanation about FCP and its computation based on variants 1-5 see http://owl.vse.cz:8080/ SumPre2016-FCP.pdf. 7 http://lov.okfn.org/dataset/lov/ 8 http://ws.nju.edu.cn/njvr/ 9 http://ws.nju.edu.cn/falcons/objectsearch/ 10 The latter are experimental ontologies which were created as a merge of their defi- nitions spreading over RDF files. we included the OntoFarm ontology collection, which includes 16 small but rel- atively rich ontologies from the conference organization domain. The collection has previously been used for experiments in Ontology Matching and elsewhere.11 Sc1, Ontology Search Based on Focus Classes,12 is supported in a four-step workflow. First, an initial ontology pool is selected and a keyword for FC search is provided by the user; the keyword can be searched as a whole localname or as a part of it. Second, the user obtains ontologies divided into two tables, the first providing the ontologies containing FCs according to the input keyword, and the second providing ontologies with classes (non-FCs) matching the keyword but having no COs. For both tables OOSP also provides ontology metrics values. Showing not only FCs but also non-FCs can be of interest especially in single- domain collections. For example, although 12 ontologies from OntoFarm contain the review concept, only 5 allow to categorize reviews according to one of the patterns. (The highest number of COs is attributed to the ekaw ontology where there are 23 COs out of which 18 COs is related to variant/pattern 3.) Third, for each ontology for which some FC was found the user can check which FCs were discovered and with which COs. For example, in the ekaw ontology reviews can be categorized, in variant 3, either according to the type of paper being reviewed (regular paper, workshop paper etc.) or by the role of person who wrote the review (workshop chair, session chair etc.). For large ontologies with many relevant COs, the user can ask for a random sample of size between 10 and 100 COs. The fourth, optional step consists in providing feedback to the system on which COs are not proper OCs: it is going to be used for training a classifier. An assumption behind Sc1 is that if an important term of a domain is in the role of FC in an ontology then this ontology covers an important part of this domain overall. For example, searching in LOV ontology pool for FCs based on the ‘university’ keyword retrieves three ontologies with FCs and five ontologies with non-FCs. While in the former group all are from the university/education domain, the latter group only contains 60% of ontologies from this domain. Sc2, Ontology Search Based on Overall Categorization Power,13 is supported in a four-step workflow originally introduced for OOSP in [5]. After the ontology pool selection the user can filter the ontologies not only according to axiom- based, taxonomy-based, and similar ontology metrics, but also according to the number of COs the ontology provides and/or FCs an ontology contains. For example, let us consider the user wants to find ontologies providing a relatively large ABox and at the same time a large proportion of their classes can be categorized. The user sets the ‘number of instances’ metric to ‘at least 100’ and the ratio of (any type of) FCs to all classes to ‘at least 0.8’. In LOV (01- 2016) there are 10 ontologies having both these characteristics, while there are 55 ontologies having more than 100 instances and 176 ontologies with FC ratio higher than 0.8. 11 http://owl.vse.cz:8080/ontofarm/ 12 In OOSP, it is available after ‘Go to Ontology Search Based on Focus Classes’ option. 13 In OOSP, after ‘Go to Metrics Selection’ and an ontology pool selection, it is available as ’Metrics Based On Focus Classes’ option on ’Entity’ metrics page. 4 Related Work The categorization power of ontologies has not been, to our knowledge, studied with the flavor presented here. However, our approach can be compared to ex- isting ontology search tools. Users of Watson can search by keywords using a number of parameters: entity types, match level and scope of keywords [1]. Ex- tensive search is provided by the LOV portal [4] in terms of metadata, ontology and terms. Besides full-text search it is possible to narrow the search by filtering on term type (class, property, datatype or instance), language, etc. Other ap- proaches focus mainly on term search, e.g. vocab.cc 14 [2], LOVR Framework [3]. Since OOSP is primarily ontology-oriented, we do not foresee term-level search, however we plan to provide ontology full-text search similar as in Watson. 5 Conclusions and Future Work The demo introduces ontology search based on categorization power as an ex- tension to the OOSP tool. The presented scenarios are based on the assumption that discovery of FCs and their categorization power can support ontology search since categorization is an often (implicitly) required feature. Currently we only discover FCs and their categorization power based on the ontology TBox, how- ever, when available, we also plan to process the ABox. Property value pairs assigned to different instances can be used to automatic discovery of COs. We will also consider more variants of CO construction in future. While we currently offer ontology search based on a priori FCs, we also plan to deal with a posteriori FCs, in the sense of clustering the selected ontologies according to their shared FCs. Sharing FCs may indicate that the ontologies are close to each other. Acknowledgement Ondřej Zamazal has been supported by the CSF grant no. 14- 14076P. References 1. d’Aquin M., Sabou M., Motta E., Angeletou S., Gridinoc L., Lopez V., Zablith F. What can be done with the Semantic Web? An Overview of Watson-based Appli- cations. In: 5th Workshop on Semantic Web Applications and Perspectives, SWAP 2008, Rome, Italy. 2. Stadtmüller S. Harth A. Grobelnik M. Accessing information about linked data vocabularies with vocab.cc. In: Semantic Web and Web Science. 2013. Springer. 3. Stavrakantonakis I., Fensel A., Fensel D. Linked Open Vocabulary Recommenda- tion Based on Ranking and Linked Open Data. In: Joint International Semantic Technology Conference. 2015. Springer. 4. Vandenbussche P. Y., Atemezing G. A., Poveda-Villalón M., Vatant B. Linked Open Vocabularies (LOV): a gateway to reusable semantic vocabularies on the Web. Se- mantic Web Journal. 2015. 5. Zamazal O., Svátek V.: OOSP: Ontological Benchmarks Made on the Fly. In: Work- shop SumPre’15 at ESWC 2015. 14 http://vocab.cc/