Contextual Itemset Mining in DBpedia Julien Rabatel1 , Madalina Croitoru1 , Dino Ienco2 , Pascal Poncelet1 1 LIRMM, University Montpellier 2, France 2 IRSTEA UMR TETIS, LIRMM, France Abstract. In this paper we show the potential of contextual itemset mining in the context of Linked Open Data. Contextual itemset min- ing extracts frequent associations among items considering background information. In the case of Linked Open Data, the background informa- tion is represented by an Ontology defined over the data. Each resulting itemset is specific to a particular context and contexts can be related each others following the ontological structure. We use contextual mining on DBpedia data and show how the use of contextual information can refine the itemsets obtained by the knowledge discovery process. 1 Introduction We place ourselves in a knowledge discovery setting where we are interested in mining frequent itemsets from a RDF knowledge base. Our approach takes into account contextual data about the itemsets that can impact on what itemsets are found frequent depending on their context [16]. This paper presents a proof of concept and shows the potential advantage of contextual itemset mining in the Linked Open Data (LOD) setting, with respect to other approaches in the literature that do not consider contextual information when mining LOD data. We make the work hypothesis that the context we consider in this paper is the class type (hypothesis justified by practical interests of such consideration such as data integration, alignment, key discovery, etc.). This work hypothesis can be lifted and explored according to other contexts such as predicates, pairs of subjects and objects, etc. [1] as further discussed in Section 5. Here we are not interested in the use of how the mined itemsets can be relevant for ontological rule discovery [15], knowledge base compression [12] etc. We acknowledge these approaches and plan to investigate how, depending on the way contexts are considered, we can mine different kind of frequent itemsets that could be further used for reasoning. Therefore, against the state of the art our contribution is: – Taking into account contextual information when mining frequent itemsets on the Linked Open Data cloud. This allows us to refine the kind of infor- mation the mining process can provide. – Introducing the notion of frequent contextual pattern and show how it can be exploited for algorithmic considerations. We evaluate our approach on the DBpedia [13] dataset. In Section 2 we give a short example of the intuition of our approach. Section 3 explains the theoretical foundations of our work. In Section 4 we briefly present the DBpedia dataset and explain the obtained results. Section 5 concludes the paper. 2 Paper in a Nutshell A semantic knowledge base is typically composed of two parts. The first part is the ontological, general knowledge about the world. Depending on the subset of first order logic used to express it this part is also called TBox (in Description Logics [4]), support (in Conceptual Graphs [8]) or rules (in Conceptual Graphs and rule based languages such as Datalog and Datalog+- [6]). The second part is the factual knowledge about the data defining how the instances are in relation with each other. In Description Logics the factual knowl- edge is called ABox. Usually in the Linked Open Data, the factual knowledge is stored using RDF (usually in a RDF Triple Store) as triples “Subject Predicate Object”. Recently, within the Ontology Based Data Access [7] the data (factual information) can also be stored in a relational databases. A study of the trade-off of using different storage systems (with equivalent expressivity) was recently done in [5]. DBpedia organises the data in three parts: – The first part is the ontology (representing the TBox). The rules do not introduce existential variables in the conclusion (unlike existential rules as in Datalog+-) and they represent the Class Type hierarchy and the Predicate Type hierarchy. The ontology is guaranteed to be acyclic in the version of DBpedia we used. In the Directed Acyclic Graph (DAG) representing the ontology there are around 500 nodes, with around 400 being leaves. The maximal height is inferior to 10. The ontology we consider in the example in this section is depicted in Figure 1(b). We consider a six class ontology represented by a binary tree with height three. The algorithmic implications of the structure of the ontology are discussed in Section 5. Additionally, we consider the binary predicates “playsWith”, “eats” and “hates” of signature “(Animal, Animal)”. – The second part is the mapping based types containing the instance type definition. In the example in this section, using the ontology defined in Fig- ure 1(b), we consider the following mapping types (using a RDF triple no- tation): “Bill hasType Dog”, “Boule hasType Human”, “Tom hasType Cat”, “Garfield hasType Cat” and “Tweety hasType Bird ”. – The third part consists of the mapping based properties that correspond to the factual information. In the example here we consider the following facts: “Bill eats Tweety”, “Tweety hates Bill ”, “Bill playsWith Boule”, “Tom eats Tweety”, “Tom hates Bill ”, “Garfield hates Bill ”, “Garfield eats Tweety”, “Garfield hates Boule”. These facts are summarized in Figure 1(a). In this paper we make the choice of working with the data from the perspec- tive of the Class of the Subject. As mentioned in the introduction we motivate Subject Predicate Object Animal Bill eats Tweety Tweety hates Bill Bill playsWith Boule Tom eats Tweety NiceAnimal NastyAnimal Tom hates Bill Garfield hates Bill Garfield eats Tweety Dog Cat Bird Garfield hates Boule (a) A fact base F. (b) An ontology H. Fig. 1: A knowledge base KB = (F, H). tid IAnimal Bill {(eats, T weety), (playsW ith, Boule)} T weety {(hates, Bill)} T om {(eats, T weety), (hates, Bill)} Garf ield {(hates, Bill), (eats, T weety), (hates, Boule)} Fig. 2: The transactional database TKB,Animal for the context Animal in the knowledge base KB depicted in Figure 1. this initial choice from the perspective of various data integration tools on the Linked Open Data. One of the main challenges encountered by these tools is the mapping between various class instances. It is then not uncommon to consider the RDF database from a class type at a time. If we consider this point of view then we model the couple “(Predicate,Object)” as an item, and the set of items associated with a given subject an itemset. The itemset corresponding to each distinct subject from F are depicted in Figure 2. One may notice that 50% of the itemsets depicted in Figure 2 include the subset {(hates, Bill), (eats, T weety)}, while 75% include {(hates,Bill)}. But this is simply due to the fact that our knowledge base contains a lot of cats (that hate Bill). Actually, if we look closer, we notice that all cats hate Bill and all birds hate Bill but no dogs hate Bill. By considering this contextual information we could be more fine-tuned with respect to frequent itemsets. 3 Theoretical Foundations The contextual frequent pattern (CFP) mining problem aims at discovering pat- terns whose property of being frequent is context-dependent. This section explains the main concepts behind this notion for Linked Open Data. We consider as input a knowledge base KB = (F, H) composed of an ontol- ogy H (viewed as a directed acyclic graph) and a set of facts F. The set of facts F is defined as the set of RDF triples of the form (subject, predicate, object) . Each element of the triple is defined according to the ontology1 . The ontol- ogy, also denoted the context hierarchy H, is a directed acyclic graph (DAG), denoted by H = (VH , EH ), such that VH is a set of vertices also called contexts and EH ⊆ VH × VH is a set of directed edges among contexts. H is naturally associated with a partial order