Towards Combining Machine Learning with Attribute Exploration for Ontology Refinement Jedrzej Potoniec1 , Sebastian Rudolph2 , and Agnieszka Lawrynowicz1 1 Institute of Computing Science, Poznan University of Technology, Poland {jpotoniec,alawrynowicz}@cs.put.poznan.pl 2 Technische Universität Dresden, Germany sebastian.rudolph@tu-dresden.de Abstract. We propose a new method for knowledge acquisition and on- tology refinement for the Semantic Web utilizing Linked Data available through remote SPARQL endpoints. This method is based on combina- tion of the attribute exploration algorithm from formal concept analysis and the active learning approach from machine learning. 1 Introduction Knowledge acquisition is a process of capturing knowledge, typically from a hu- man expert, and thus it concerns all systems and environments where that kind of knowledge is required. It is also said to be major bottleneck in development of intelligent systems due to its difficulty and time requirements. Of course the Semantic Web, as an area concerned with structured and precise representation of information, has to deal with exactly the same issue. Since the early days of the Semantic Web, building ontologies has been a dif- ficult and laborious task. Frequently people trying to express complex knowledge do not know how to perform this task properly. Mistakes come from difficulty in understanding the complex logic formalism supporting OWL. Frequently an ontology engineer would start collecting vocabulary and re- quirements for an ontology, structuralize the vocabulary and later specify more complex dependencies [6]. We propose a solution to support knowledge acquisi- tion for ontology construction. Especially we address the last part of the process, where some basic knowledge is already gathered and more complex dependencies are to be specified. We aim to answer the question how to extend an ontology with meaningful, valid and non-trivial axioms taking into consideration available data and user workload ? 2 Related work For knowledge acquisition for ontology development many approaches have been proposed so far. The most basic ones are ontology editors supporting ontology development, such as Protégé 3 . In addition to that, there are methodologies helpful in ontologies development, such as the one proposed in NeOn [6]. 3 http://protege.stanford.edu/ In [1,5] applications of attribute exploration algorithm from formal concept analysis to ontology development have been proposed. [1] describes how to dis- cover subsumptions between conjunction of classes and [5] extends it to proper- ties’ domains and ranges. In [3] the idea of learning ontologies purely from Linked Data by means of dis- covering association rules is presented. [2] presents a methodology for manually building and populating domain ontologies from Linked Data. 3 Approach The proposed approach is to support the user during attribute exploration by means of machine learning (ML). The ML algorithm’s task is to answer simple, non-interesting questions posed by the attribute exploration algorithm and leave for the user only these questions which are non-trivial to answer. Input of our proposed algorithm is an ontology O, a partial context derived from it, and two thresholds θa and θr . They are, respectively, thresholds for ac- cepting and rejecting an implication and have to be manually chosen w.r.t. the used ML algorithm. Result of the algorithm is a set of probably valid implica- tions, which can be transformed into subsumptions for extending the ontology. A detailed description of the algorithm is presented below. For sake of clarity its description treats the attribute exploration algorithm as a black box, which provides the next implication to consider. 1. Generate implication L → R by means of the attribute exploration algo- rithm. 2. For every r ∈ R, do the following sequence of steps: (a) If L → {r} is already refuted by some of the known individuals, go to the nextdr. (b) If O |= L v r, remember implication L → {r} as a valid one and go to the next r. (c) Compute probabilities of acceptance pa and rejection pr of the implica- tion L → {r} with the ML algorithm. Note that pa + pr = 1. (d) If pa ≥ θa , remember the implication L → {r} as a valid one and go to the next r. (e) If pr ≥ θr , go to the step 2i. (f) Ask user if implication L → {r} is valid. (g) Add considered implication with user’s answer to a set of learning ex- amples for the ML algorithm. (h) If the implication is valid, remember it as a valid one and go to the next r. (i) Otherwise, extend the partial context with a counterexample either pro- vided by user or auto-generated. The purpose of iteration through the set of conclusions R in the algorithm is twofold. We believe that this way user can more easily decide if the presented implication is valid or not, because she does not have to consider complex relation between two conjunctions of attributes. The other thing is that this way automated generation of counterexamples provides more concrete results. For an arbitrary implication L → R a counterex- ample can be generated and said to have all attributes from L and to not have at least one attribute from R. This is not in line with the method of partial context induction, as it is unclear which exactly attribute from R the counterexample does not have. Because of that partial context can not reflect knowledge base ac- curately anymore, and the attribute exploration algorithm can start to generate invalid implications. If the implication has a single attribute in its right-hand side, it is clear which attribute the counterexample does not have. 3.1 Application of machine learning The task which ML algorithm is to solve can be seen as a kind of active learning with a binary classification. Every implication is classified as valid or invalid and if the algorithm is unsure, the user is asked. One should note that not every classifier generates reasonable probabilities. For example, rule-based or tree-based systems usually are not suitable for that purpose. Problem of generating probabilities can be also seen as a regression problem. Moreover costs of both types of mistakes are different and distribution of learning examples can be heavily imbalanced, i.e. implications with one decision may appear much often than with other decision. To reflect these fact a classifier suitable for cost-sensitive learning is required. To apply machine learning techniques, a way to transform implications to feature vectors is required. We apply three approaches to this problem. First of all, a single purely syntactic measure is used: the number of attributes in the left-hand side divided by the number of all attributes. Secondly, there are features made of values of measures typical for association rules mining. Their computation is based on features of individuals in the ontology. Following the naming convention from [4], we use coverage, prevalence, support, recall and lift. Finally, we use a mapping from the set of the attributes to Linked Data in order to obtain the number of objects in an RDF repository supporting an implication or its parts. Every attribute is mapped to a SPARQL graph pattern with a single featured variable denoting the object identifier. Following the same name convention from [4], coverage, prevalence, support, recall and confidence are used. All of these features can be computed using only SPARQL COUNT DISTINCT expressions and basic graph patterns and thus they maintain relatively low complexity and are suitable to use with remote SPARQL endpoints. Such a feature vector is later labeled with the classifier mentioned above and given answer (valid/invalid/unsure) is used to either refine the ontology or ask the user. If the user is asked, her answer is then used as a correct label for the feature vector and the classifier is relearned. 4 Conclusions and future work As we are proposing a method which is to make development and refinement of domain-specific ontologies easier, our main goal for evaluation is to validate its practical usability. We plan to apply our method to a selection of domain-specific ontologies concerning some knowledge of general type such as literature, music and movies. We plan to use a crowdsourcing service to validate our hypotheses. We hope that with ontologies with a theme being general enough and additional information available in the Internet, the crowd will be able to validate our decisions about implications and Linked Data mappings. We believe that our approach is promising and will be able to help ontology engineers in the process of ontology refinement. We are combining three tech- nologies very suitable for this kind of a task. First of all, the attribute exploration algorithm that has been developed especially for discovering additional relations between attributes. Moreover, Linked Data is supposed to describe parts of the world. Obviously, this description can not be assumed to be neither accurate nor complete, yet it should be sufficient to support the user in a process of on- tology refinement. Finally, the whole purpose of machine learning algorithms is to adapt themselves, and thus they are suitable to replace the user in uniform, repeatable tasks. Acknowledgement. Jedrzej Potoniec and Agnieszka Lawrynowicz acknowl- edge support from the PARENT-BRIDGE program of Foundation for Polish Sci- ence, cofinanced from European Union, Regional Development Fund (Grant No POMOST/2013-7/8 LeoLOD – Learning and Evolving Ontologies from Linked Open Data). References 1. Baader, F., Ganter, B., et al.: Completing description logic knowledge bases using formal concept analysis. In: Proc. of IJCAI 2007. pp. 230–235. AAAI Press (2007) 2. Dastgheib, S., Mesbah, A., Kochut, K.: mOntage: Building Domain Ontologies from Linked Open Data. In: IEEE Seventh International Conference on Semantic Com- puting (ICSC). pp. 70–77. IEEE (2013) 3. Fleischhacker, D., Völker, J.: Inductive learning of disjointness axioms. In: Meers- man, R., Dillon, T., et al. (eds.) On the Move to Meaningful Internet Systems: OTM 2011, LNCS, vol. 7045, pp. 680–697. Springer Berlin Heidelberg (2011) 4. Le Bras, Y., Lenca, P., Lallich, S.: Optimonotone measures for optimal rule discov- ery. Computational Intelligence 28(4), 475–504 (2012) 5. Rudolph, S.: Acquiring generalized domain-range restrictions. In: Medina, R., Obiedkov, S. (eds.) Formal Concept Analysis, LNCS, vol. 4933, pp. 32–45. Springer Berlin Heidelberg (2008) 6. Suárez-Figueroa, M.C., Gómez-Pérez, A., Fernández-López, M.: The NeOn Method- ology for Ontology Engineering. In: Suárez-Figueroa, M.C., Gómez-Pérez, A., et al. (eds.) Ontology Engineering in a Networked World, pp. 9–34. Springer Berlin Hei- delberg (2012)