Ontology-Driven Association Rule Extraction: A Case Study

Ontology-Driven Association Rule Extraction: A Case Study AndreaBellandi a.bellandi@imtlucca.it IMT -Lucca Institute for Advanced Studies Piazza S. Ponziano

6 -55100 Lucca ITALY

BarbaraFurletti b.furletti@imtlucca.it IMT -Lucca Institute for Advanced Studies Piazza S. Ponziano

6 -55100 Lucca ITALY

ValerioGrossi vgrossi@di.unipi.it Department of Computer Science University of Pisa Largo B. Pontecorvo

3 -56127 Pisa ITALY

AndreaRomei romei@di.unipi.it Department of Computer Science University of Pisa Largo B. Pontecorvo

3 -56127 Pisa ITALY

Ontology-Driven Association Rule Extraction: A Case Study 1D172429565B9752D0596C12E90388AF GROBID - A machine learning software for extracting information from scholarly documents

This paper proposes an integrated framework for extracting Constraint-based Multi-level Association Rules with an ontology support. The system permits the definition of a set of domain-specific constraints on a specific domain ontology, and to query the ontology for filtering the instances used in the association rule mining process. This method can improve the quality of the extracted associations rules in terms of relevance and understandability.

Introduction

The Data Mining (DM) results, i.e. the models, represent relations in the data and are usually employed for classifying new data or for describing correlations hidden in the data. In this paper, we focus on the Association Rule Mining as originally introduced by Agrawal et al. in [2] and on a way for improving the process results. There are several ways to reduce the computational complexity of Association Rule Mining and to increase the quality of the extracted rules: (i) reducing the search space; (ii) exploiting efficient data structures; (iii) adopting domain-specific constraints. The first two classes of optimizations are used for reducing the number of steps of the algorithm, for re-organizing the itemsets, for encoding the items, and for organizing the transactions in order to minimize the algorithm time complexity. The third class tries to overcome the lack of user data-exploration by handling domain-specific constraints. This paper focuses on these optimizations by representing a specific domain by means of an ontology and driving the extraction of association rules by expressing constraints. The aim of this work is to reduce the "search space" of the algorithm and to improve the significance of the association rules.

Paper Organization. Section 2 provides some notions of OWL ontologies, data mining and association rules. Section 3 introduces the syntax of the constraints and describes the process. Section 4 presents a case study based on a real dataset. Section 5 discusses the related works and section 6 proposes some ideas for further improvements.

OWL Overview

OWL is a family of three ontology languages: OW L − Lite, OW L − DL, and OW L − F ull. The first two languages can be considered syntactic variants of the SHIF(D) and SHOIN (D) description logics (DL), respectively, whereas the third language was designed to provide full compatibility with RDF(S). We focus mainly on the first two variants of OWL because OWL-Full has a nonstandard semantics that makes the language undecidable and therefore difficult to implement. OWL comes with several syntaxes, all of which are rather verbose. Hence, in this paper we use the standard DL syntax [3]. The main building blocks of DL knowledge bases are concepts (or classes), representing sets of objects, roles (or properties), representing relationships between objects, and individuals representing specific objects. OWL ontologies consist of two parts: intensional and extensional. The former part consists of a T Box and an RBox, and contains knowledge about concepts (i.e. classes) and the complex relations between them (i.e. roles). The latter part consists of an ABox, and contains knowledge about entities and how they relate to the classes and roles from the intensional part. In our scenario, TBox and RBox shall provide supermarket domain knowledge, while all the supermarket items constitute ABoxes which are interlinked with intensional knowledge.

The semantics for OWL DL is fairly standard.

Data Mining and Association Rules

Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. The relationships and summaries derived through a data mining exercise are often referred to as models or patterns. The main tasks of Data mining are generally divided in two categories: P redictive and Descriptive. The objective of the predictive tasks is to predict the value of a particular attribute based on the values of other attributes, while for the descriptive ones, is to derive patterns (correlations, trends, clusters, ...) that summarize the relationships in the data. The Association rule mining is one of the major techniques of data mining and it is perhaps the most common form of local-pattern discovery in unsupervised learning systems. These methodologies retrieve all possible interesting patterns in the database. Given a database D of transactions, where each transaction T ∈ D is a set of items, an association rule is a (statistical) implication of the form X → Y , where X, Y ∈ D and X ∩ Y = ∅. A rule X → Y is said to have a support (or frequency) factor s if and only if, at least s% of the transations in T satisfy X ∪ Y . A rule X → Y is satisfied in the set of transactions T with a conf idence factor c if and only if, at least c% of the transactions in T that satisfy X also satisfy Y . The support is a measure of statistical significance, whereas the confidence is a measure of the strength of the rule. A rule is said to be "interesting" if its support and confidence are greater than user-defined thresholds sup min and con min , respectively, and the objective of the mining process is to find all such interesting rules [13].

Description of the approach

In this section, we describe our approach for guiding the extraction process of Multi-level Constraint-based Association Rules with an ontology support. Our scenario consists of the set of components shown in figure 1. The ontology (O D ) describes the domain of interest (D) and it is used as a means of meta-data representation. The interpretation module translates the requests of an user (user constraints) into a set of f ormal constrains (Q D defined on O D ) so that they can be supplied to the Ontology Query Engine by means of a suitable query language. The aim of these constraints is to exclude some items from the output association rules, or to characterize interesting items according to an abstraction level. The user constraints syntax is formalized in table 1. It includes both pruning constraints, used for filtering a set of non-interesting items, and abstraction constraints, which permit a generalization of an item to a concept of the ontology. By using pruning constraints, one can specify the exclusion of a set of items from the input transactions set, and, as a consequence, from the extracted rules. This kind of constraints refers either to a single item, or to an ontology concept, and they can include a condition expressed on a set of ontology properties. Abstraction constraints permit exploring different levels of the ontology concepts. The generalization to a predefined level of the hierarchy I is the set of items (i1, i2, ...in ∈ I). C is the set of the concepts of the ontology (c1, c2, ...cn ∈ C). Pc is the set of the properties of the concept c ∈ C (p1, p2, ...pn ∈ Pc). condc is a Description Logic expression. ALL represents all the instances defined in the ontology.

A constraint is defined on I, C and PC in the following form: improves the support of association rules, and consequently avoids the discovery of a massive quantity of useless rules, especially in case of sparse data.

The ontology query engine interacts with the ontology by performing the set Q D of queries. The resulting R D instances set, is used by the DB query engine for retrieving the instances that contain the filtered/abstracted/pruned items (i.e., the items specified in R D ). The data base is the repository of the data to pass in input to the data mining tools. The box "Data Mining Tools" contains the tool for analyzing and processing the data. In our context we refer to a specific algorithm for extracting association rules, but we would like to point out that the system can operate with other kinds of DM tools. The support and the confidence measures are initially provided by the user.

Case study: a Market Basket Analysis application

In this section we show the results of a case study by using data taken from a national supermarket, and stored in a relational database (DB). The aim of this study is to construct and test the framework described in the previous section with real data and w.r.t. specific market analysis. In this case, the data consist of a set of purchase transactions T = [transID, item], where transID is the cash voucher identification and item is the purchased item. The DB contains 775,000 transactions. According to the approach proposed in sec. 3, meta-data (description of the items) and data to analyze have been organized respectively in separate structures:

The ontology -contains the description of the items and their hierarchical organization. Starting from the DB structure (tables and fields) 3 , we derived the OWL ontology schema mapping the fields of the DB tables in classes and properties of the ontology. Also, we automatically filled up the ontology with about 30,000 items, and their attributes (approximately 100).

Let us consider the item Vodka Keglevich Melon. The DB -contains the transactions T .

The experimentation has been conducted using SeRQL ("Sesame RDF Query Language") [4] language for querying the ontology and the Apriori algorithm [1] for mining association rules. SeRQL is an RDF/RDFS query language that is currently being developed by Aduna as part of Sesame [5]. It combines the best features of other (query) languages (RQL, RDQL, N-Triples, N3) and adds some of its own. Sesame is a RDF database which can be employed to manage RDF triples.

In the first two tests we abstract all items to two upper levels (level L2 and level L1) for verifying what categories of items are bought together. In this way we abstract all items to only 14 high level concepts in the first case and to only 4 high level concepts in the second one. These abstraction constraints can be expressed respectively as:

Query 1 ≡ abstract 2 (ALL) Query 2 ≡ abstract 1 (ALL)

The third test concerns an investigation for organizing a future promotional campaign during the holidays (Christmas and Easter). The focus is on typical sweets and cakes (with well-known brands) of the two holidays, and the alcoholic drinks. The objective is to verify how those articles are related. All kinds of sweets/cakes are abstracted to Foodstuffs (associated with the item brand) and all kinds of alcoholic drinks to Drinks. These constraints can be expressed as:

Query 3 ≡ prune (∃hasBrand.= null ) (ALL)∧ abstract (∃hasBrand.<> null ) (Alcoholic, Drinks) ∧ abstract (∃hasBrand.<> null ) (Sweets, F oodStuf f s) ∧ abstract ((∃hasRecurrence.= Easter ) (∃hasRecurrence.= Christmas )) (Sweets, F oodStuf f s)

The part of the ontology schema (i.e. the part of the DL knowledge base) related to the Query 3 can be expressed by the following TBox fragment: According to the interpretation function I = (∆ I , • I ) defined in section 2, the semantic interpretation of the conditions expressed by the abstract clauses is: where {A} and {S} are the instances sets of the classes Alcoholic and Sweets respectively, with a ∈ {A} and s, p, q ∈ {S}; b a , b s are any well-known brands of Alcoholic and Sweets respectively. The semantic expressed by prune clause is very similar to abstract so we omit it for lack of space.

Alcoholic (∃hasBrand.

In the last test, we consider the case in which the supermarket augments its services by introducing a new department (Assisted Service). This event introduces an innovation in the supermarket domain, so we have to modify the ontology4 i.e. we have to introduce a new data property, for some category (typeOfService (ToS) with enumerated type Assisted Service, Take Away, Free Service). We abstract to level L2 all the items with typeOfService equals to Assisted Service or Take Away, ignoring the others. This constraint can be expressed as:

Query 4 ≡ abstract 2 (∃hasT oS.= AssistedService ) (ALL) ∧ abstract 2

(∃hasT oS.= T akeAway ) (ALL) ∧ prune ((∃hasT os.<> AssistedService ) (∃hasT os.<> T akeAway )) (ALL)

For the lack of space we omit the semantic interpretation of the Query4.

For evaluating our framework we submitted to the system the queries introduced above. Our framework automatically translates these constraints into SeRQL language for querying the ontology. In all tests we applied the Apriori implementation of the KDDML System [10], setting the support threshold to 1% 5 , and confidence to 50%. In Table 2, the five rows represent the results of the tests. The first query labeled no constraints represents the request without any constraints. #Trans reports the number of transactions that satisfy the constraints, #Items reports the total number of different articles that compose the transactions, #Itemsets and #Rules report the number of itemsets and the rules computed by the Apriori, respectively. Furthermore LI and AI contain statistical information about the number of items contained in the largest transaction, and the average number of items contained in a transaction. In figure 2 Query we report the supports graph of the queries. In the abscissa there are the top 50 frequent itemsets, while in the ordinate there is the support related to the i th frequent item. As you can notice, in the picture the result of Test 2 has not been reported because it contains only 15 frequent itemsets. The use of real data typically brings issues related to the quality of the extracted model. Items at the lower levels of the taxonomy may not have enough support to appear in any frequent itemsets. This aspect is underlined in figure 2 in which we can notice that the Query 0 retrieves only itemsets with a very low support. This is mainly due to the large number of articles. Moreover, rules extracted at the lower levels of a concept, are too much specific, and may not be interesting. Consider for example the following rule extracted at low level: The rule is not relevant due to the low support. Consider instead the following rule, that corresponds to the previous, but at an higher level of abstraction, and This rule abstracts all the items to level L2 of the ontology and each of them is selected by the typeOfService property. The information extracted from this association rule can suggest that the assisted service department has to provide to the customers also (take away) cooked meals (roastedchicken, cookedlasagne).

In general, items abstracted at the higher levels, tend to have higher support counts. This fact increases the quality of the extracted rules, and as consequence, helps the analyst in the decision support. Association rules related to Query 3, for example, emphasize the concept of multi-level rule correlating concepts at different abstraction level. For example the concept FoodStuffs (level L2) with BAU LI and M OT T A as brands, and Drinks (level L2) with AST I6 , are related to Red M eats (level L7) slaughtered and packed by the supermarket. It can suggest to the analyst some marketing decisions on these products during Easter or Christmas period.

The study of multi-level association rules is well-known in literature, and in this context, our work may not seem innovative. The focus of our approach is the introduction of the expressive power of ontologies for constraint-based multilevel association rule mining. The main advantages can be summarized in terms of extensibility and flexibility. Our framework is extensible because data properties and concepts can be introduced in the ontology without either changing the relational database containing the transaction, or the implementation of our framework. The flexibility is guaranteed from the separation of the data to analyze (the transactions) from the meta-data (description of the data). Furthermore it interesting to point out that our approach is general, and can be adapted to further data mining analysis.

Related Works

Methods to define and integrate item constraints are originally introduced by Srinkant and Agrawal in [11] and by Han and Fu in [7]. Recently, in [12] and [9], we can find the attempt to integrate the item-constraints evaluation directly in the rule extraction algorithm. In [12], the authors concentrate on improving the Apriori algorithm, while in [9] the authors focus on the definition of a two-phase approach: specification of the constraint association queries, and submission of the constraints in the mining process.

Our approach follows the research line proposed by the cited works, nevertheless it introduces three main differences: (i) we employ an ontology to represent an item taxonomy; (ii) constraints can be defined on the basis of specific properties of the items; (iii) by using an ontology instead of a taxonomy, a new item property or a concept can be added without re-engineering the (meta-data) representation model or the relational database.

Other studies concern the merging of the association rules mining with a domain ontology. In [6], the authors use an ontology to improve the counting support during the association rule mining phase by using a taxonomy. Another interesting approach is presented in [8], where an ontology-based algorithm is employed for discovering rules of product fault causes, in an attempt to discover high-level clearer rules. In this case, the system enables the user only to specify an ideal level of generality of the extracted rules. In addition, our framework also enables the users to specify different levels of abstraction for different items, depending on the specific properties of such items. A concise syntax has been defined to this aim. In our view, the use of an ontology enforces constraints definition, enabling us to use data properties in domain-specific constraints.

Conclusions and future works

We proposed an integrated framework for the extraction of constraint-based multi-level association rules with the aid of an ontology. Our system permits the definition of domain-specific constraints by using the ontology for filtering the instances used in the association rule mining process. The main advantages of the proposed framework can be summarized in terms of extensibility and flexibility.

In our case study, the supermarket domain is modeled only by classes and data properties and it would be very interesting to study: (i) how object properties (and more complex logical relationships) can be employed in our framework;

(ii) what aspects they can improve. Other important future works are the possibility of modeling the antecedent and the consequent of an association rule as ontology concepts in order to express constraints on the association rules structure. Furthermore we could improve the system by integrating the constraints evaluation directly in the mining algorithm.

An interpretation I = (∆ I , • I ) is a tuple where ∆ I , the domain of discourse, is the union of two disjoint sets ∆ I O (the object domain) and ∆ I D (the data domain) and I is the interpretation function that gives meaning to the entities defined in the ontology. I maps each OWL class C to a subset C I ⊆ ∆ I O , each object property P Obj to a binary relation P I Obj ⊆ ∆ I O × ∆ I O , and each datatype property P Data to a binary relation P I Data ⊆ ∆ I O × ∆ I D . The whole definition is in the OWL W3C Recommendation (http://www.w3.org/TR/owl-semantics/).

Fig. 1 .1Fig. 1. The system architecture.

1 .1Pruning Constraints. A pruning constraint is of one of the following forms: (a) prune(e), where e ∈ I ∪ C ∪ {ALL}. (b) prune condc (c), where c ∈ C ∪ {ALL}. 2. Abstraction Constraints. An abstraction constraint is of one of the following forms: (a) abstract(e, c), where e ∈ I ∪ C, c ∈ C and c is a super-concept of e. (b) abstract condc 1 (c1, c2) where c1 ∈ C ∪ {ALL}, c2 ∈ C and c2 is a super-concept of c1. (c) abstract l conde (e), where e ∈ I ∪ C ∪ {ALL}, and l is a non-negative integer indicating the level of the hierarchy; cond can be unspecified.

<> null ) Sweets (∃hasBrand. <> null ) (∃hasRecurrence. = Easter ) (∃hasRecurrence. = Christmas ) I = Alcoholic I ∩ {xa | ∃ya.(xa, ya) ∈ hasBrand I ∧ ya = null I } ∪ Sweets I ∩ {xs | ∃ys.(xs, ys) ∈ hasBrand I ∧ ys = null I } ∩ {z | ∃w.(z, w) ∈ hasRecurrence I ∧ w = Easter I } ∪ {h | ∃k.(h, k) ∈ hasRecurrence I ∧ k = Christmas I } = {A} ∩ {xa | ∃ya.(xa, ya) ∈ {(a, ba)} ∧ ya = null} ∪ {S} ∩ {xs | ∃ys.(xs, ys) ∈ {(s, bs)} ∧ ys = null} ∩ {zs | ∃ws.(zs, ws) ∈ {(p, rp)} ∧ ws = Easter} ∪ {h | ∃ks.(hs, ks) ∈ {(q, rq)} ∧ ks = Christmas} = {A} ∩ {(a, brand)} ∪ {S} ∩ {(s, brand)} ∩ {(p, Easter)} ∪ {(q, Christmas)} = {(alcoholic, ba)} ∪ {(sweets Easter , bs)} ∪ {(sweets Christmas , bs)}

{bread, red wine, ham, chocolate cake} ⇒ {roasted chicken, cooked lasagne} [supp = 0.02, conf = 0.57].

Fig. 2 .2Fig. 2. Compared Supports.

Table 1 .1User constraints syntax.

The correspondent hierarchical structure and the list of the item attributes are shown in the table below.

Hierarchical StructureAttributes of Vodkaowl:ThingshasColour : transparent;∇XXX SupermarkethasAlcoholicContent: high;∇L0 Foodstuffs and Drinks Department hasFlavour : Melon;∇L1 DrinkshasBrand: Keglevich;∇L2 VodkaisFizzy: No;∇L3 SpicyhasPrice: EUR 7.56;Vodka Keglevich Melon hasSize: 70 cl;

Table 2 .2Queries summary resultsID#Trans #Items #Itemsets #Rules LI AIno constraints Query0915631231765076 7.68Test 1 Query18076511524124812 3.86Test 2 Query276323415314 2.83Test 3 Query3352336096 2.17Test 4 Query4695341020025810 4.03

We considered the DB table named Marketing that, for each article, specifies a hierarchical structure w.r.t. the department organization in the supermarket. Notice that, the introduction of a new property does not imply the re-engineering of the structure, but only the introduction of the property in the higher classes so that the property is inherited by each subclasses. This low support threshold is dued to the large number of items. M OT T A, BAU LI and AST I are Italian food and drink brands.

Acknowledgement. This work is supported by M U SIN G project (www.musing.eu/).

Fast algorithms for mining association rules in large databases RAgrawal MMethta JShafer RSrikant Proceedings of the 20th International Conference on Very Large Databases (VLDB '94) the 20th International Conference on Very Large Databases (VLDB '94)

Santiago de Chile, Chile

Mining association rules between sets of items in large databases RAgrawal RSrikant ASwami Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD '93) ACM SIGMOD Conference on Management of Data (SIGMOD '93)

San Diego, CA

The Description Logic Handbook FBaader DCalvanese DMcguinness DNardi PPatel-Schneider Cambridge University Press 2003 JBroekstra AKampman SeRQL: An RDF Query and Transformation Language 2004 Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema JBroekstra AKampman FVan Harmelen Proceedings of the first International Semantic Web Conference (ISWC 2002) Lecture Notes in Computer Science IanHorrocks JamesHendler the first International Semantic Web Conference (ISWC 2002)

Sardinia, Italy; Heidelberg Germany

Springer Verlag June 9 12, 2002 2342 5468 Using an interest Ontology for Improved Support in Rule Mining XChen XZhou RScher JGeller Proceedings of the 5th International Conference of Data Warehousing and Knowledge Discovery (DaWaK 2003) the 5th International Conference of Data Warehousing and Knowledge Discovery (DaWaK 2003)

Prague, Czech Republic

Discovery of multiple-level association rules from large databases JHan YFu Proceedings of the 21st International Conference on Very Large Data Bases (VLDB '95) the 21st International Conference on Very Large Data Bases (VLDB '95)

San Francisco, CA

Application of Data Mining in Fault Diagnosis Based on Ontology XHou JGu XShen WYan Proceedings of the 3rd Conference on Information Technology and Applications (ICITA '05) the 3rd Conference on Information Technology and Applications (ICITA '05)

Sydney, Australia

Exploratory mining and pruning optimizations of constrained associations rules RNg TLakshmanan L V SHan J PangA Proceedings of the 1998 ACM SIGMOD international conference on Management of data the 1998 ACM SIGMOD international conference on Management of data

Seattle, WA

SIGMOD KDDML: a middleware language and system for knowledge discovery in databases ARomei SRuggieri FTurini Data and Knowledge Engineering 57 2 2006 Mining Generalized Association Rules RSrikant RAgrawal Proceedings of the 21st International Conference on Very Large Data Bases (VLDB '95) the 21st International Conference on Very Large Data Bases (VLDB '95)

San Francisco, CA

Mining Association Rules with Item Constraints RSrikant QVu RAgrawal Proceedings of the 3rd International Conference of Knowledge Discovery and Data Mining (KDD '97) the 3rd International Conference of Knowledge Discovery and Data Mining (KDD '97)

Newport Beach, CA

Introduction to Data Mining PNTan MSteinbach VKumar 2006 Pearson International Edition -Addison Wesley