=Paper=
{{Paper
|id=None
|storemode=property
|title=None
|pdfUrl=https://ceur-ws.org/Vol-1434/proceedings-fcaa.pdf
|volume=Vol-1434
}}
==None==
FCA&A 2015 Proceedings of the International Workshop on Formal Concept Analysis and Applications (co-located with ICFCA 2015) Universidad de Málaga (Dept. Matemática Aplicada), Spain ISBN 978-84-606-7410-8 The 13th Intl Conf on Formal Concept Analysis Formal Concept Analysis and Applications FCA&A 2015 Nerja (Málaga), Spain June 23–26, 2015 (an ICFCA workshop) Edited by Manuel Ojeda-Aciego Jaume Baixeries Christian Sacarea Proceedings of FCA&A 2015 (co-located with ICFCA 2015) c Manuel Ojeda-Aciego, Jaume Baixeries, Christian Sacarea, Editors This work is subject to copyright. All rights reserved. Reproduction or publica- tion of this material, even partial, is allowed only with the authors’ and editors’ permission. Technical Editor: Manuel Ojeda-Aciego, aciego@uma.es Table of Contents Preface Depth-first Search for Pseudo-intents through Minimal Transversals . . . . . 1 Alexandre Bazin Edit Operations on Lattices for MDL-based Pattern Summarization . . . . . 17 Keisuke Otaki and Akihiro Yamamoto An Approach Towards Classifying and Navigating RDF data based on Pattern Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Mehwish Alam and Amedeo Napoli Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors 49 Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui Context-Dependent Deductive and Inductive Reasoning Based on Good Diagnostic Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Xenia Naidenova and Vladimir Parkhomenko Concept Lattices of RDF Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Jens Kötters Variability representation in product lines using concept lattices: feasibility study with descriptions from Wikipedia’s product comparison matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez Preface Formal Concept Analysis (FCA) is a mathematical field rooted in lattice and order theory which, although being of such a theoretical nature, has proved to be of interest to various applied fields such as Knowledge Discovery and Data Mining, Database Theory, Data Visualization, and many others. The 13th International Conference on Formal Concept Analysis (ICFCA) took place from the 23th to the 26th of June, 2015 in Nerja (Málaga), Spain, organized by members of the Universidad de Málaga. The International Workshop on For- mal Concept Analysis and Applications (FCA&A) was organized in conjunction to ICFCA, and consisted of the seven works included in this volume. Our deepest gratitude goes to all the authors of submitted papers, the Program Committee members, and the external reviewers. Working with the efficient and capable team of local organizers was a constant pleasure. We are deeply indebted to all of them for making this conference a successful forum on FCA. Last, but not least, we are most grateful to the organizations that sponsored this event: the Universidad de Málaga, Andalucı́a Tech (International Campus of Excellence), the Patronato de Turismo de la Costa del Sol and the Área de Turismo del Ayuntamiento de Nerja, all in Spain. Finally, we would like to emphasize the great help of EasyChair for making the technical duties easier. June 2015 Manuel Ojeda-Aciego Jaume Baixeries Christian Sacarea Depth-first Search for Pseudo-intents through Minimal Transversals Alexandre Bazin contact@alexandrebazin.com Abstract. The enumeration of pseudo-intents is a long-standing prob- lem in which the order plays a major role. In this paper, we present new algorithms that enumerate pseudo-intents in orders that do not necessar- ily respect the inclusion relation. We show that, confirming established results, this enumeration is equivalent to computing minimal transversals in hypergraphs a bounded number of times. 1 Introduction Formal concept analysis, as a field of applied mathematics, is interested in the patterns that can be found in data taking the form of a set of objects described by attributes. Among these patterns are the implications, relations between at- tribute sets A and B representing the fact that every object described by A is also described by B. As often with this type of relation, interest revolves around a small, non redundant set of implications. Such a set, the Duquenne-Guigues basis, is constructed using the notion of pseudo-intent. Indeed, computing the Duquenne-Guigues basis, and thus all the implications in a data set, is equiva- lent to enumerating all the pseudo-intents. In [15], it is shown that the number of pseudo-intents can be exponential in the number of attributes. It has been proven in [6, 2] that pseudo-intents cannot be enumerated with a polynomial delay in lectic or reverse lectic order. However, to the best of our knowledge, no such result exists for other orders. The best-known algorithm, Next Closure [11], enumerates only in the lectic order. In a previous work [4], we proposed an exponential delay algorithm to enumerate in any order that extends the inclu- sion order. In order to continue the study of the influence of the order on the complexity of this enumeration problem, we propose here two new algorithms that, together, can enumerate pseudo-intents without respecting the inclusion. Our algorithms enumerate pseudo-intents incrementally, i.e. given a formal context and a set of previously found implications, they return a new pseudo- intent. In order to do this, we define new conditions under which a pseudo-intent P can be recognized using the lower covers of P in the lattice of attribute sets closed under the implications already found. This allows us to build algorithms that, instead of starting from ∅ and adding attributes to construct sets, start from > and remove attributes, resulting in a depth-first search in the lattice. We show that both constructing and recognizing a pseudo-intent can be done c 2015 for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. Local Proceedings also appeared in ISBN 978-84-606-7410-8, Universidad de Málaga (Dept. Matemática Aplicada), Spain. 2 Alexandre Bazin by computing a bounded number of times the minimal transversals of some hypergraph, which makes it easy to bound the delay and isolate special, easier cases. This result establishes a link with the proof in [6] that enumerating pseudo- intents without order is at least as hard as computing minimal transversals. We start by recalling in Section 2 the basic definitions of FCA and results on the enumeration of pseudo-intents and minimal transversals. In Section 3, we present the definition of a recognizable pseudo-intent that we use along with the properties needed for our algorithms. Finally, Sections 4 and 5 are dedicated to the two algorithms, an example and a brief study of their delay. 2 Preliminaries 2.1 Basics In formal concept analysis, data takes the form of a formal context. A formal context is a triple C = (O, A, R) in which O is a set of objects, A is a set of attributes and R ⊆ O × A a relation that maps attributes to objects. An object o ∈ O is said to be described by an attribute a ∈ A if (o, a) ∈ R. Together with the context, there are two ·0 operators defined as ·0 : 2O 7→ 2A O0 = {a ∈ A | ∀o ∈ O, (o, a) ∈ R} ·0 : 2A 7→ 2O A0 = {o ∈ O | ∀a ∈ A, (o, a) ∈ R} Their composition ·00 is a closure operator. A set of attributes A = A00 closed under ·00 is called an intent. 2.2 Implications An implication is a relation between two attribute sets A and B, noted A → B. It is said to hold in the context if and only if A0 ⊆ B 0 or, in other words, if and only if every object described by A is also described by B. A set I of implications is a basis if and only if every implication that holds in the context can be derived from I using Armstrong’s rules : B⊆A A→B A → B, B → C , , A→B A∪C →B∪C A→C A pseudo-intent (or pseudo-closed set) P is a set of attributes that contains the closure of all its subsets and is not an intent. The set B = {P → P 00 | P Depth-first Search for Pseudo-intents through Minimal Transversals 3 Fig. 1. A formal context is a pseudo-intent} is the implication basis with the minimum cardinality and is called the Duquenne-Guigues basis [12]. As such, computing the Duquenne- Guigues basis, and thus all the implications, is enumerating all the pseudo- intents. As one of the great problems in FCA, it has been abundantly studied [12, 15, 6, 17, 4, 3, 16, 1, 19]. As the number of pseudo-intents can be exponential in the size of the context, preventing a polynomial algorithm, the attention is instead focused on the delay, i.e. the time between two pseudo-intents and the last pseudo-intent and the end of the algorithm. It has been shown that pseudo- intents cannot be enumerated with a polynomial delay in lectic [6] or reverse lectic order [2]. For the enumeration without order, it is shown in [6] that it is at least as hard as computing the minimal transversals of a hypergraph. However, the upper bound is not yet known. The problem of recognizing a pseudo-intent has been studied in [1] and found to be coNP-complete. 2.3 Logical Closures A set of implications I gives rises to a logical closure operator I(·) defined as A+ = A ∪ {C | B → C ∈ I and B ⊆ A} I(A) = A++...+ (up to saturation) Similarly, we have the logical pseudo-closure operator I − (·) defined as A◦ = A ∪ {C | B → C ∈ I and B ⊂ A} I − (A) = A◦◦...◦ (up to saturation) It is known that intents are closed under both B − (·) and B(·), whereas a pseudo-intent P is closed under B − (·) and I(·) for I ⊆ B \ {P → P 00 }. An attribute set Y is said to be a generator of an attribute set X under I(·) (resp. I − (·)) if I(Y ) = X (resp. I − (Y ) = X). It is a minimal generator 4 Alexandre Bazin Fig. 2. ΦI with I = {de → bde, be → bde} if there is no generator Z of X such that Z ⊂ Y . We will use GenI (X) to denote the set of all minimal generators of a set X under I(·). The complexity of computing the minimal generators of a set has been studied in [14]. The number of minimal generators can be exponential in the number of attributes and they can be enumerated with an incremental polynomial delay. We shall use ΦI to denote the lattice of attribute sets closed under I(·) ordered by the inclusion relation. For an attribute set A, the set of pseudo- intents P in I such that P ⊆ A and P 00 6⊆ A will be denoted by PI (A). 2.4 Minimal Transversals The problem of computing minimal transversals in hypergraphs or, equivalently, the prime conjunctive normal form of the dual of a Boolean function (otherwise called monotone dualization), have been largely studied in the literature [13, 10, 8, 9, 7]. Given a hypergraph H = (E, V) where E is a set of edges and V ⊆ 2E is a set of vertices, a transversal T ⊆ E is a set of edges such that ∀V ∈ V, T ∩ V 6= ∅. A transversal is minimal if none of its subsets is a transversal. We shall use T r(V ) to denote the set of minimal transversals of a set of vertices V . For example, T r({ace, bce, cde}) = {c, e, abd}. The problem of computing all the minimal transversals of a hypergraph can be solved in quasi-polynomial total time [9, 10, 13]. It is currently not known Depth-first Search for Pseudo-intents through Minimal Transversals 5 whether it is possible to enumerate the transversals with a polynomial delay in the general case. However, many special cases are known to have an output- polynomial delay. Their references can be found in [9]. In this work, we cannot presuppose the nature of the hypergraph and so we are interested in the general case. The problem still being actively studied, we will suppose that we have a blackbox algorithm nextTrans that enumerates the minimal transversals of a set of vertices X. We suppose, for ease of writing, that the transversals Ti this algorithm enumerates in some arbitrary order T1 < T2 < ... < Tn with nextTrans(Ti , X) = Ti+1 and nextTrans(Tn , X) = ∅. For example, Berge Multiplication [5] can be adapted to take the role of nextTrans. It is important to note that a first transversal can be computed in polynomial time by simply removing attributes until the set stops being a transversal. 3 Recognizing a Pseudo-Intent We know that an attribute set is a pseudo-intent if and only if it contains the closure of all its subsets that are pseudo-intents. As for the problem of recognizing pseudo-intents, two cases have been studied. When only a context and a set A are provided, checking whether A is a pseudo-intent is coNP-complete [1]. An algorithm has been proposed in [19] that runs in O(2|A| ). When the context and A are provided alongside all the pseudo-intents (and thus implications) contained in A, checking whether A is a pseudo-intent is easier as we only have to verify the closedness of A for the logical closure and ·00 , which implies a runtime polynomial in the size of the implication set. It is this method that is used in algorithms such as Next Closure or the one we proposed in [4] and its drawback is that it forces an enumeration in an order that extends the inclusion order. In this paper, we are interested in enumerating pseudo-intents in orders that do not extend the inclusion and, as such, we cannot suppose that we know the closure of all the subsets of a set before attempting to recognize its pseudo- closedness. Hence, we propose to consider the problem of recognizing a pseudo- intent A given a context and a set of subsets of A that is not necessarily complete. Proposition 1. An attribute set A ∈ ΦI is a pseudo-intent if A 6= A00 and all its lower covers in ΦI are closed. Proof. Let us suppose that A is not closed and that all its lower covers are closed. Let B be a lower cover of A and X be a subset of A. If B ⊂ X, then I(X) = A and X 00 = A00 . If X ⊆ B and X 00 6⊆ B, then B cannot be closed and we have a contradiction. Therefore, X 00 ⊆ B. Since the closure of every subset of A is either A00 or contained in A, the set A is a pseudo-intent. This proposition states that some pseudo-intents can be recognized in ΦI only by looking at their lower covers. As such, when I is a subset of the Duquenne- Guigues basis, a new pseudo-intent can be found by looking in ΦI for sets that only have intents as lower covers. 6 Alexandre Bazin Definition 1. A pseudo-intent that can be recognized using Proposition 1 with a set of implication I is said to be recognizable. The set of all the implications that are recognizable with I is denoted by Rec(I). Proposition 2. ∀I ⊂ B, Rec(I) 6= ∅ Proof. Let P be minimal among pseudo-intents that are not premises of impli- cations in I. If there is a lower cover X of P in ΦI that is not closed, then there is a set A ⊂ X in ΦI such that A → A00 holds in the context. If A00 ⊆ P , then P is not minimal. If A00 6⊆ P , then P is not a pseudo-intent. Both cases lead to contradictions so all the lower covers of P are closed and P is recognizable from I. Since we have I ⊂ B, the set of pseudo-intents that are not a premise of I is not empty, so it has minimal elements. Hence, the set of pseudo-intents that are recognizable from I is not empty if I ⊂ B. We have that, even though some unknown pseudo-intents may not be recog- nizable, there are always additional recognizable pseudo-intents for any subset of the Duquenne-Guigues basis. This ensures that we are able to enumerate all the pseudo-intents with an algorithm that finds recognizable ones. Proposition 3. Let A and B be two elements of ΦI . The set B is a lower cover of A if and only if B = A \ C with C a minimal transversal of GenI (A). Proof. Let C be minimal such that ∀G ∈ GenI (A), G∩C 6= ∅. For any attribute i ∈ A \ C, the set (A \ C) ∪ {i} = A \ (C \ {i}) contains a minimal generator of A because of the minimality of C. Therefore, any B between A \ C and A is such that I(B) = A and cannot be in ΦI . Hence, A \ C is a lower cover of A. Let B be a lower cover of A in ΦI . By definition, I(B ∪ {i}) = A for any attribute i ∈ A \ B so there is a subset C of A such that i ∈ C and (C \ {i}) ⊆ B that is a minimal generator of A. This proposition states that the lower covers of a set can be computed from and depend on its minimal generators. As such, the number of lower covers can be exponential in the number of minimal generators which can itself be exponential in the size of the attribute set [14]. Definition 2. For any two attribute sets A and B, we say that B is reachable from A if and only if B ⊂ A and there is an ordering x1 < x2 < ... < xn of the elements of A \ B such that, for every m < n, A \ {x1 , x2 , ..., xm } is not closed under I. In other words, B is reachable from A if you can obtain B by removing at- tributes from A and only go through sets that are not closed under I. Evidently, minimal reachable sets are elements of ΦI . Proposition 4. An attribute set A ∈ ΦI is a minimal recognizable pseudo- intent if and only if A is not an intent and all the minimal attributes sets reach- able from A are intents. Depth-first Search for Pseudo-intents through Minimal Transversals 7 Algorithm 1 reachableSets(A) Require: An attribute set A, an implication set I 1: R = ∅ S i ∈ A do 2: for every 3: G = P ∈PI (A\{i}) GenI (P ) 4: C = The first transversal of G 5: while C 6= ∅ do 6: B = A \ ({i} ∪ C) 7: if PI (B) 6= ∅ then 8: R = R ∪ reachableSets(B) 9: else 10: R = R ∪ {B} 11: end if 12: C = nextT rans(C, G) 13: end while 14: end for 15: Return R Proof. If A is a minimal recognizable pseudo-intent, then every subset of A in ΦI is an intent. Thus every minimal reachable set is an intent. Let us suppose that every minimal reachable subset of A is an intent and A 6= A00 . For any set X reachable from A and any attribute set P ⊆ X, if P 00 6⊂ A, then X cannot be an intent. Every minimal reachable subset of A being an intent, A contains the closure of all its subsets so it is a pseudo-intent. If P is a pseudo-intent that is not in I, subsets of A \ {i} with i ∈ P 00 \ P cannot be intents so either P or a superset of P in ΦI that is not an intent is reachable from A. Hence, A is a minimal recognizable pseudo-intent. A minimal recognizable pseudo-intent can therefore be recognized by com- puting the sets of its minimal reachable subsets. These minimal reachable sets can be computed using Algorithm 1. By definition, reachable sets can be com- puted by removing from A attributes that play a role in the logical closure, i.e. attributes in minimal generators of pseudo-intents P ⊂ A such that P 00 6⊆ A \ X for some A \ X reachable from A. The minimal transversals T make up all the possible ways to remove minimal generators of pseudo-intents and, thus, all the sets T such that for every Y ⊂ T , A \ Y cannot be logically closed. As such, removing minimal transversals of the generators of pseudo-intents used in the logical closure of a set produces a reachable set. The set A \ T is not always logically closed so the algorithm is recursively applied until an element of ΦI is reached. The algorithm is not optimal as some reachable sets can be computed mul- tiple times and the sets reachable from a set A that is not an element of ΦI could be computed directly from PI (A) instead of its direct subsets. However, it is sufficient to give us an upper bound of the runtime. For each recursive call, the algorithm computes PI (A \ {i}) for every attribute i ∈ A. For each of these 8 Alexandre Bazin S attributes, it then computes the minimal transversals of P ∈PI (A\{i}) GenI (P ). There is another recursive call for each pseudo-intent found missing so the run- time is bounded by O(|A| × |I| × T ) where T is the complexity of computing the transversals. In the rest of this paper, we will suppose that we have an algorithm nex- tReach that enumerates minimal reachable sets in the same fashion as next- Trans. 4 Computing a Recognizable Pseudo-Intent 4.1 Algorithm We want to incrementally compute the pseudo-intents in a formal context. In the beginning, we only have the formal context (O, A, R) and an empty set of implications I. The corresponding ΦI is the boolean lattice of subsets of A. In order to find a first pseudo-intent, we can look for a recognizable set (Proposition 1) in ΦI . We know that a non-closed set contains a pseudo-intent. Thus, we can start with A and recursively remove attributes as long as we obtain non-closed sets. When we cannot remove attributes without obtaining an intent, the set is a recognizable pseudo-intent and we have a first implication. This corresponds to the algorithm presented in [6] that computes a first pseudo-intent. We can then generalize the algorithm for other pseudo-intents. As I contains a new implication, sets disappear from ΦI and we now have to look in the new lattice for a second pseudo-intent. As it is not a Boolean lattice anymore, we cannot simply remove attributes to obtain lower covers so we have to use Proposition 3 to compute them. By going through the lattice using non-closed lower covers of sets until we find a set respecting Proposition 1, we can compute a second pseudo-intent. Pseudo-intents can thus be computed incrementally until A no longer has any non-closed lower cover. Algorithm 2, given a context, a subset I of the Duquenne-Guigue Basis and an attribute set A ∈ ΦI , uses this method to compute a new pseudo-intent P ⊆ A. Proposition 5. nextP I(A, I) returns either A or a pseudo-intent that is not in I. Proof. Let X be a set in ΦI . If X 00 6= X, then there is a pseudo-intent P ⊆ X. If a lower cover of X is not closed, then there is a pseudo-intent X ⊂ S. The algorithm returns the first set it finds that has all its lower covers closed. If this set is not A, then it is not closed and is thus a pseudo-intent. Algorithm 3 uses Algorithm 2 to enumerate pseudo-intents iteratively until it returns A. It is sound but not complete, as exemplified below. Depth-first Search for Pseudo-intents through Minimal Transversals 9 Algorithm 2 nextP I(A, I) Require: Attribute set A, Implication set I 1: P = A 2: G = GenI (A) 3: C = The first transversal of G 4: while C 6= ∅ do 5: S =A\C 6: if S 00 6= S then 7: P = nextP I(S) 8: C=∅ 9: else 10: C = nextT rans(C, G) 11: end if 12: end while 13: RETURN P Algorithm 3 allP I 1: I = ∅ 2: A = nextP I(A, I) 3: while A 6= A do 4: I = I ∪ {A → A00 } 5: A = nextP I(A, I) 6: end while 7: Return I 4.2 Example Let us consider an arbitrary context for which the Duquenne-Guigues basis is {de → bde, bc → bcd, ac → abcd, be → bde, bcde → abcde}. We want to find its pseudo-intents using Algorithm 3 : The set of implications is initially empty. We start with abcde. It has a single minimal generator, itself. The first minimal transversal of Gen(abcde) is a. The set abcde \ a = bcde is not an intent so we continue recur- sively with it. The set bcde has a single minimal generator, itself. The first minimal transver- sal of Gen(bcde) is b. The set bcde \ b = cde is not an intent so we continue recursively with it. The set cde has a single minimal generator, itself. The first minimal transver- sal of Gen(cde) is c. The set cde\c = de is not an intent so we continue recursively with it. The set de has a single minimal generator, itself. The first minimal transversal of Gen(de) is d. The set de \ d = e is an intent. The second and last minimal transversal is e. The set de \ e = d is an intent. The algorithm returns de as the first pseudo-intent which gives us I = {de → bde}. 10 Alexandre Bazin Fig. 3. The five steps of the algorithm. In red the sets considered by the recursive calls, in black the lower covers of those sets in ΦI . We start again with abcde. It has a single minimal generator, acde. The first minimal transversal of Gen(abcde) is a. The set abcde \ a = bcde is not an intent so we continue recursively with it. The set bcde has a single minimal generator, cde. The first minimal transver- sal of Gen(cde) is c. The set bcde \ c = bde is an intent. The second minimal transversal is d. The set bcde \ d = bce is not an intent so we continue recursively with it. The set bce has a single minimal generator, itself. The first minimal transver- sal of Gen(bce) is b. The set bce \ b = ce is an intent. The second minimal transversal is c. The set bce \ c = be is not an intent so we continue recursively with it. The set be has a single minimal generator, itself. The first minimal transversal of Gen(be) is b. The set be \ b = e is an intent. The second and last minimal transversal is e. The set be \ e = b is an intent. The algorithm returns be as the second pseudo-intent which gives us I = {de → bde, be → bde}. We start again with abcde. It has two minimal generators, acde and abce. The first minimal transversal of Gen(abcde) is a. The set abcde \ a = bcde is not an intent so we continue recursively with it. The set bcde has two minimal generators, bce and cde. The first minimal transversal of Gen(cde) is c. The set bcde \ c = bde is an intent. The second Depth-first Search for Pseudo-intents through Minimal Transversals 11 minimal transversal is bd. The set bcde \ bd = ce is an intent. The third and last minimal transversal is e. The set bcde \ e = bcd is an intent. The algorithm returns bcde as the third pseudo-intent which gives us I = {de → bde, be → bde, bcde → abcde}. We start again with abcde. It has two minimal generators, cde and bce. The first minimal transversal of Gen(abcde) is bd. The set abcde \ bd = ace is not an intent so we continue recursively with it. The set ace has a single minimal generator, itself. The first minimal transver- sal of Gen(ace) is a. The set ace \ a = ce is an intent. The second minimal transversal is c. The set ace \ c = ae is an intent. The third minimal transversal is e. The set ace \ e = ac is not an intent so we continue recursively with it. The set ac has a single minimal generator, itself. The first minimal transversal of Gen(ace) is a. The set ac \ a = c is an intent. The second and last minimal transversal is c. The set ac\c = a is intent. The algorithm returns ac as the fourth pseudo-intent which gives us I = {de → bde, be → bde, bcde → abcde, ac → abcd}. We start again with abcde. It has three minimal generators, ace, cde and bce. The first minimal transversal of Gen(abcde) is c. The set abcde \ c = abde is an intent. The second minimal transversal is e. The set abcde \ e = abcd is an intent. The third and last minimal transversal is abd. The set abcde \ abd = ce is an intent. The algorithm ends. The algorithm has found four pseudo-intents but is unable to find the set bc when considering the minimal transversals, and thus the sets, in that partic- ular order. Indeed, knowing be → bde and de → bde makes bcde recognizable and blocks every path from abcde to bc in ΦI . Finding bc before either de or be solves the problem. 4.3 Complexity The nature of Algorithm 2 makes it easy to bound the delay between finding two pseudo-intents. As the algorithm starts from A and removes attributes until it ends, it performs a maximum of |A| recursive calls before finding a pseudo- intent. In a call, three different tasks are performed : computing the closure of a set, computing the set of minimal generators and computing the set of all lower covers. The closure of a set is known to be computable in polynomial time. Computing GenI (A) is done in time polynomial in the size of the output. The computation of all the lower covers of A, as we have seen in Section 2.4, can be performed in quasi-polynomial total time, i.e. in time quasi-polynomial in the size of n = |Gen I (A)| and m = |T r(GenI (A))| (note that the size of m itself is n bounded by n [18]). Hence, the delay is in O(|A| × (C + G + T )) with C 2 the complexity of computing a closure, G the complexity of computing minimal generators and T the complexity of computing the lower covers. 12 Alexandre Bazin Algorithm 4 nextM inP I(A, I) Require: Attribute set A, Implication set I 1: P = A 2: C =The first minimal set reachable from A 3: while C 6= ∅ do 4: if C 00 6= C then 5: P = nextM inP I(C) 6: C=∅ 7: else 8: C = nextReach(C, A) 9: end if 10: end while 11: RETURN P An interesting point of detail is that the algorithm does not necessarily enu- merate all the intents. When all the upper covers of an intent A in ΦI are intents themselves, A cannot be reached by the algorithm anymore. In particu- lar, if A ∪ {i} = (A ∪ {i})00 for every i ∈ A \ A, we are sure that A will never be considered. In the previous example (see Figure 3), we see that the intents abd, ab, ad, bd, de and ∅ are never considered because they do not appear as lower covers of sets. We suppose that this property helps reduce the runtime on dense contexts with a high number of intents. 5 Computing a Minimal Recognizable Pseudo-Intent 5.1 Algorithm The problem with Algorithm 3 is that there are some pseudo-intents it cannot compute. Among those leftover pseudo-intents, there is necessarily at least one that is minimal. We have shown in Proposition 4 that a minimal recognizable pseudo-intent P can be recognized through the minimal sets reachable from P . Those reachable sets can themselves be obtained by computing the transversals of the sets of minimal generators of the pseudo-intents involved in the logical closure of the different P \ {i} (Algorithm 1). Algorithm 4 uses these properties to compute a minimal recognizable pseudo-intent. In a manner similar to that of Algorithm 2, the algorithm starts with A and enumerates the minimal reachable sets. Once one that is not an intent is found, the algorithm is recursively applied. The algorithm ends when it finds a set with only intents as minimal reachable sets. The main difference between Algorithm 2 and Algorithm 4 is that the former performs a depth-first search in the lattice ΦI while the latter performs it in the directed graph in which the nodes are the elements of ΦI and the edges are the pairs in the “reachable” relation. Proposition 6. Algorithm 4 returns either A or a minimal pseudo-intent that is not in I. Depth-first Search for Pseudo-intents through Minimal Transversals 13 Algorithm 5 allP I2 1: I = ∅ 2: A = nextminP I(A, I) 3: while A 6= A do 4: I = I ∪ {A → A00 } 5: A = nextM inP I(A, I) 6: end while 7: Return I Proof. From Proposition 4 we know that if every minimal set T ∈ ΦI reachable from A ∈ ΦI is closed and A is not, then A is a minimal recognizable pseudo- intent. If T ∈ ΦI is not closed, then T contains a minimal recognizable pseudo- intent so the algorithm can be recursively called on T . The algorithm stops when it encounters a set A with only intents as minimal reachable sets. The set A being the only intent on which the algorithm is called, it returns either A or a minimal recognizable pseudo-intent. We can enumerate all the pseudo-intents with Algorithm 5 using Algorithm 4 in the same fashion as Algorithm 3. Proposition 7. Algorithm 5 is sound and complete. Proof. Soundness follows from Proposition 6. For any implication set I ⊂ B, pseudo-intent P not used in I and attribute set A ∈ ΦI such that P ⊂ A, no set X such that P ⊂ X and P 00 6⊆ X can be an intent so either P or one of its non-closed supersets is reachable from A. Hence, the algorithm does not stop until every pseudo-intent has been found. 5.2 Example Contrary to Algorithm 3, Algorithm 5 is complete. We can either use it to compute all the pseudo-intents or combine it with Algorithm 3. Let us suppose we have used Algorithm 3, as exemplified in Section 4, and that we know the set of implications I = {de → bde, be → bde, bcde → abcde, ac → abcd}. Using Algorithm 5 from there gives us the following : We start with abcde. The first minimal reachable set abcde\(a∪bd) = ce is an intent. The second minimal reachable set abcde \ (a ∪ bc) = de is an intent. The third minimal reachable set abcde\(a∪e) = bcd is an intent. The fourth minimal reachable set abcde \ (b ∪ cd) = ae is an intent. The fifth minimal reachable set abcde\(b∪e) = cd is an intent. The sixth minimal reachable set abcde\(b∪ce) = ad is an intent. The seventh minimal reachable set abcde \ c = abde is an intent. The eighth minimal reachable set abcde \ (d ∪ ab) = ce is an intent. The ninth minimal reachable set abcde \ (d ∪ ae) = bc is not an intent so we continue recursively with it. 14 Alexandre Bazin The set bc contains no pseudo-intents so its minimal reachable sets are bc\b = c and bc \ c = b. They are both intents so the algorithm returns bc as a new pseudo-intent which gives us I = {de → bde, be → bde, bcde → abcde, ac → abcd, bc → bcd}. We start again with abcde. The eight first minimal reachable sets computed are the same as in the previous step. The ninth reachable set abcde \ cde = ab is an intent. The tenth and last reachable set abcde \ e = abcd is an intent. The algorithm ends. In this example, the pseudo-intent bc has been found after bcde. Using both Algorithm 3 and Algorithm 5, it is thus possible to enumerate pseudo-intents in orders that do not extend the inclusion. 5.3 Complexity Once again, the delay between two pseudo-intents with Algorithm 5 is easy to bound. Algorithm 4 performs at most |A| recursive calls before returning a pseudo-intent or A. Each call requires the computation of, at most, all the reachable sets in ΦI . We have shown that this can be done in O(|A| × |I| × T ) with T the complexity of computing minimal transversals of minimal generators of pseudo-intents. Thus, the worst case delay of Algorithm 5 is |I| times that of Algorithm 3. A first minimal transversal can be computed in polynomial time. As such, for any attribute set A, we can compute its first reachable set in ΦI in time polynomial in |A| and |I|. Hence, if, for every A ∈ ΦI that is not a pseudo-intent, the reachable sets in ΦI are ordered in such a way that the first one is not closed, Algorithm 5 can compute (but not necessarily recognize) a pseudo-intent in time polynomial in the size of the implication set. A minimal pseudo-intent P , by definition, does not contain any other pseudo-intent so the set of its reachable set is {P \ {p} | p ∈ P } which can be computed in polynomial time. Thus, there are always orderings of reachable sets such that Algorithm 5 can compute minimal pseudo-intents with an incremental polynomial time between two of them. However, as shown in [6], minimal pseudo-intents cannot be enumerated in output polynomial time. Indeed, even though it is possible to compute a minimal pseudo-intent in time polynomial in the size of the implication set, confirming that they have all been found is exponential because non-minimal pseudo-intents and A have a potentially exponential number of reachable sets. 6 Conclusion We have proposed two algorithms for computing pseudo-intents using the com- putation of minimal transversals as their main operation. The first can find pseudo-intents before some of their subsets that are themselves pseudo-intents but it is not complete for some enumeration orders. The second is complete but can only enumerate in orders that respect the inclusion relation. They can be Depth-first Search for Pseudo-intents through Minimal Transversals 15 combined to obtain the enumeration order of the first with the completeness of the second. We expressed their delay as a function of the complexity of com- puting minimal transversals. We showed that, for some orderings of attribute sets, pseudo-intents can be reached, but not recognized, in incremental polyno- mial time. As both reaching and recognizing pseudo-intents, in our context, are related to the problem of computing minimal transversals in hypergraphs for which many special cases are known, we believe that this work may be useful in isolating more cases for which the enumeration of pseudo-intents is easier. On the practical side, the fact that the algorithms do not necessarily enu- merate the entirety of the set of intents should significantly reduce the runtime on dense contexts. Furthermore, the depth-first strategy used in the algorithms allows for some computations to be saved and reused. Further algorithmic opti- mizations and empirical comparisons with other algorithms for the same problem will be the subject of future investigations. References 1. Mikhail A. Babin and Sergei O. Kuznetsov. Recognizing Pseudo-intents is coNP- complete. In Proceedings of the 7th International Conference on Concept Lattices and Their Applications, Sevilla, Spain, October 19-21, 2010, pages 294–301, 2010. 2. Mikhail A. Babin and Sergei O. Kuznetsov. Computing Premises of a Minimal Cover of Functional Dependencies is Intractable. Discrete Applied Mathematics, 161(6):742–749, 2013. 3. Konstantin Bazhanov and Sergei A. Obiedkov. Comparing Performance of Algo- rithms for Generating the Duquenne-Guigues Basis. In Proceedings of The Eighth International Conference on Concept Lattices and Their Applications, Nancy, France, October 17-20, 2011, pages 43–57, 2011. 4. Alexandre Bazin and Jean-Gabriel Ganascia. Enumerating Pseudo-Intents in a Partial Order. In Proceedings of the Tenth International Conference on Concept Lattices and Their Applications, La Rochelle, France, October 15-18, 2013., pages 45–56, 2013. 5. Claude Berge. Hypergraphs: Combinatorics of Finite Sets, volume 45. Elsevier, 1984. 6. Felix Distel and Baris Sertkaya. On the Complexity of Enumerating Pseudo-intents. Discrete Applied Mathematics, 159(6):450–466, 2011. 7. Thomas Eiter and Georg Gottlob. Identifying the Minimal Transversals of a Hy- pergraph and Related Problems. SIAM J. Comput., 24(6):1278–1304, 1995. 8. Thomas Eiter, Georg Gottlob, and Kazuhisa Makino. New Results on Mono- tone Dualization and Generating Hypergraph Transversals. SIAM J. Comput., 32(2):514–537, 2003. 9. Thomas Eiter, Kazuhisa Makino, and Georg Gottlob. Computational As- pects of Monotone Dualization: A Brief Survey. Discrete Applied Mathematics, 156(11):2035–2049, 2008. 10. Michael L. Fredman and Leonid Khachiyan. On the Complexity of Dualization of Monotone Disjunctive Normal Forms. J. Algorithms, 21(3):618–628, 1996. 11. Bernhard Ganter. Two Basic Algorithms in Concept Analysis. In Preprint Tech- nische Hochschule Darmstadt, 1984. 16 Alexandre Bazin 12. Jean-Louis Guigues and Vincent Duquenne. Familles Minimales d’Implications Informatives Résultant d’un Tableau de Données Binaires. Mathématiques et Sci- ences humaines, 95:5–18, 1986. 13. Vladimir Gurvich and Leonid Khachiyan. On Generating the Irredundant Con- junctive and Disjunctive Normal Forms of Monotone Boolean Functions. Discrete Applied Mathematics, 96-97:363–373, 1999. 14. Miki Hermann and Baris Sertkaya. On the Complexity of Computing Generators of Closed Sets. In Formal Concept Analysis, 6th International Conference, ICFCA 2008, Montreal, Canada, February 25-28, 2008, Proceedings, pages 158–168, 2008. 15. Sergei O. Kuznetsov. On the Intractability of Computing the Duquenne-Guigues Bas. J. UCS, 10(8):927–933, 2004. 16. Sergei O. Kuznetsov and Sergei A. Obiedkov. Counting Pseudo-intents and #P- completeness. In Formal Concept Analysis, 4th International Conference, ICFCA 2006, Dresden, Germany, February 13-17, 2006, Proceedings, pages 306–308, 2006. 17. Sergei O. Kuznetsov and Sergei A. Obiedkov. Some Decision and Counting Prob- lems of the Duquenne-Guigues Basis of Implications. Discrete Applied Mathemat- ics, 156(11):1994–2003, 2008. 18. Zbigniew Lonc and Miroslaw Truszczynski. On the Number of Minimal Transver- sals in 3-uniform Hypergraphs. Discrete Mathematics, 308(16):3668–3687, 2008. 19. Sebastian Rudolph. Some Notes on Pseudo-closed Sets. In Formal Concept Analy- sis, 5th International Conference, ICFCA 2007, Clermont-Ferrand, France, Febru- ary 12-16, 2007, Proceedings, pages 151–165, 2007. Edit Operations on Lattices for MDL-based Pattern Summarization Keisuke Otaki1,2 and Akihiro Yamamoto1 1 Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Japan ootaki@iip.ist.i.kyoto-u.ac.jp, akihiro@i.kyoto-u.ac.jp 2 Research Fellow of the Japan Society for the Promotion of Science Abstract. The closedness, a fundamental conception in FCA, is also im- portant to address the pattern redundancy problem in pattern mining, where some enumerated patterns subsume others and they are redun- dant. Although many efficient mining algorithms have been proposed, finding characteristic and easy to interpret subsets of the whole enumer- ated patterns, called the pattern summarization problem, is still chal- lenging. A well-used Information Theory-based criterion; the Minimum Description Length (MDL) principle helps us to tackle the problem, but it requires us to design the MDL evaluation from scratch according to types of patterns. In this paper we propose a new framework applicable to various patterns using lattices, which are beneficial to formulate the MDL evaluation. A key idea is revising an existing model by using edit operations defined on lattices among concepts on them, which enables us to consider additional information such as background knowledge and helps us to design the MDL evaluation. We experiment our method to see that our proposal helps us to obtain informative results from the whole sets, and confirm that our model is applicable to various patterns. Keywords: formal concept analysis, lattice structures, edit operations, pattern summarization, minimum description length principle 1 Introduction Knowledge Discovery using binary feature vectors is an important subject in data mining, where a binary value represents whether or not an object keeps some feature. A way to construct such vectors is using pattern mining: For a datum d and an enumerated pattern p, we compute a binary value bd,p = 1 if and only if p occurs in d. Since the frequent itemset mining problem was first proposed by Agrawal and Srikant [2], various patterns and algorithms have been studied [1]. We can exploit obtained vectors as representations of data within several Machine Learning methods, and tasks such as clustering and classification can be tackled with the vectors. Therefore pattern mining is an essential tool. Although many efficient mining algorithms have been studied, the pattern redundancy problem is still inevitable, where some patterns subsume others, enu- merated patterns are not independent, and obtained vectors contain redundant c 2015 for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. Local Proceedings also appeared in ISBN 978-84-606-7410-8, Universidad de Málaga (Dept. Matemática Aplicada), Spain. 18 Keisuke Otaki and Akihiro Yamamoto values. Considering the closedness, a main conception in FCA, is a fundamental way to resolve the problem and an example of the closedness used in pattern mining is closed itemsets [12], known as compact representations of itemsets. The redundancy problem, however, still remains after considering the closed- ness only. To select a subset of the whole set of enumerated patterns, the Min- imum Description Length (MDL) principle has attracted much attention [7, 8, 14–16, 18]. An important idea is to represent databases with codes in a loss-less manner, and to evaluate its difficulty by the length of codes used. For example, the Krimp algorithm [18] is a fundamental method for itemsets. Other methods for sequences and large graphs can be seen in [16] and [8], respectively. According to patterns, we need to design an evaluation method from scratch and it makes theoretical analyses difficult. It is also hard to integrate MDL-based methods with knowledge resources although various side information (e.g., corpora, on- tology, etc.) are helpful to select enumerated patterns. Developing methods applicable to various patterns in a unified framework and possible to consider side information is, therefore, an important but remaining problem. We in this paper present preliminary results of our new lattice-based framework based on an existing MDL-based method to enhance its utility for further developing of such methods to overcome problems above. Our key ideas are introducing edit operations on lattices (in Section 3.1) for the loss-less rep- resentation (by Proposition 1, Lemmata 1 and 2), and evaluating its difficulty by extending an existing formulation (in Section 3.2). To investigate the appli- cation possibility of our framework, we apply it to pattern concept lattices (in Section 4). We experiment them using common benchmark datasets (in Sec- tion 5). Our contributions are summarized below. – A new lattice-based framework ; it is extended from an existing model (code- tables in Section 2.2) by using terminologies from lattices. – The equivalency for closed itemsets (by Proposition 1). – Pattern structured-based generalization; GenSet, a generalization of itemsets using pattern structures (in Section 4) as an example of other lattices. – Empirical evaluation using some benchmark datasets (in Section 5). 2 Preliminary We outline Formal Concept Analysis (FCA) and Frequent Closed Itemset Mining (FCIM) for our new lattice-based framework, summarized in Table 1. We explain an idea of the MDL-based algorithm for itemsets, named the Krimp algorithm. 2.1 Basic Concepts Formal Concept Analysis [5] Let K = (G, M, I) be a context, where G and M are sets of objects and attributes, and I ⊆ G × M is a binary relation. A Galois connection between G and M , denoted shortly by (·)0 , is defined as A0 = {m ∈ M | (g, m) ∈ I for all g ∈ A} for A ⊆ G and B 0 = {g ∈ G | (g, m) ∈ Edit Operations on Lattices for MDL-based Pattern Summarization 19 Table 1: The correspondence between FCA and FCIM. FCA [5] FCIM with a parameter θ [12] objects G; a set of objects D; a set of transactions attributes M ; attributes I; items functions {(·)0 , (·)0 } {f, g} task find all concepts find all itemsets I ⊆ I satisfying freq(I) ≥ θ I for all m ∈ B} for B ⊆ M , respectively. A pair (A, B) is a concept if A0 = B and B 0 = A, where A and B are called the extent and the intent, and we denote them by A = Ext(c) and B = Int(c), respectively. For an object g ∈ G, the concept ({g}00 , {g}0 ) is the object concept of g, denoted by γg. Computing all concepts is an important task in FCA, and we denote the set of all concepts by B(K). For two concepts (A, B) and (C, D), the partial order is introduced by A ⊆ C, and hB(K), i becomes a complete lattice. We represent it by B if it is understood from the context. For a concept c ∈ B, we define the upset of c, denoted ↑ c, by {d ∈ B | c d}3 . In addition we define the direct upset ⇑ c by {d ∈↑ c | there exists no e ∈↑ c s.t. e d}. Frequent Closed Itemset Mining [12] Let I be the set of items. A set I ⊆ I of items is called an itemset, and it is also called a transaction. A database D = {t1 , . . . , tN } is a set of transactions. The frequency freq(I) of an itemset I is defined as freq(I) = |{t ∈ D | I ⊆ t}|. For a parameter θ, the frequent itemset mining (FIM) problem is the problem to find all itemsets satisfying freq(I) ≥ θ. We denote the set of all frequent itemsets by F. For closed itemsets, two functions f and g are defined as f (T ) = {i ∈ I | ∀t ∈ T, i ∈ t} for T ⊆ D and g(I) = {t ∈ D | ∀i ∈ I, i ∈ t} for I ⊆ I. An itemset I is closed if and only if f (g(I)) = I. The frequent closed itemset mining (FCIM) problem is to find all frequent and closed itemsets from D with respect to θ. It is easy to see that two functions f and g work in the same manner of (·)0 in FCA4 . We denote the set of all frequent closed itemsets by CF. The closed itemsets aim at reducing the number of all frequent itemsets enumerated in the naı̈ve FIM problem, and many efficient algorithms have been studied [13, 17]. However the number of enumerated itemsets is still large in applications. They are often redundant because some patterns subsume others, which causes the pattern redundancy problem: The enumerated patterns are too large and redundant to exploit them in applications although the closedness is considered. In application, we often need only a small subset that characterizes all the enumerated itemsets, which are easy to interpret. We call such extraction problem pattern summarization (also known as pattern set mining [6, 15, 18]). 3 This set ↑ c is also called the principal filter in partially ordered sets. 4 We prepare ids of transactions and the set D0 = {(i1 , t1 ), . . . , (iN , tN )}. By con- structing I = {(ik , j) | 1 ≤ k ≤ N, j ∈ tk } from D0 , the FCIM problem of θ = 1 is equivalent to computing all intents of concepts from K = ({i1 , . . . , iN }, I, I). 20 Keisuke Otaki and Akihiro Yamamoto (X, usage, code) 1 3 1 2 4 2 (X, usage, code) Encoded database Encoded database 3 2 3 2345 1 2345 4 1 4 1 2 5 6 7 9 2379 1 2379 179 25 6 5 2 5 2 3 4 5 179 3 179 2345 6 1 6 1 2 7 8 9 25 1 25 179 2 8 7 4 7 2 1 2 8 1 1 7 9 179 8 6 1 6 9 4 9 2 3 7 9 8 1 8 2379 (a) The standard code-table ST (left) and (b) A code-table CT (left) and the cover the cover of D by ST (right). of D by CT (right). Fig. 1: The cover of D = {125679, 2345, 12789, 179, 2379} (from Example 1). 2.2 Two-part MDL Principle and The Krimp Algorithm Out of various criteria for the selection, the two-part MDL principle [7] is often adopted for itemsets, where code-tables are well-studied models. Databases (or contexts) are encoded and represented by (binary) codes using code-tables. Definition 1 (Code-tables [18]). A code-table CT consists of two columns for itemsets and codes. Formally CT = {(X1 , c1 ), (X2 , c2 ), . . . } is a set of pairs, where Xi ⊆ I, ci ∈ C, and C be the set of (binary) codes. For the convenience of our discussion, in the following, we regard code-tables as sets of itemsets: For example X ∈ CT means (X, cX ) ∈ CT and cX is denoted by code CT (X). We define CT − = {(X, cX ) ∈ CT | |X| > 1}. The standard code- table ST consists of all singleton itemsets over I. Let us first introduce examples, where sets are written as sequences (e.g. {1, 7, 9} is represented as 179). Example 1. Let I = {1, . . . , 9} and D = {125679, 2345, 12789, 179, 2379}. In Fig. 1a and 1b, the left table is a code-table and the right one is a database represented by codes, where rectangles represent codes (e.g., the rectangle 7 means the code of the itemset 7). In Fig. 1a, the transaction 12789 is encoded separately. In Fig. 1b (CT = ST ∪{25, 179, 2345, 2379}), it is done by 179∪2∪8. The process of representing databases using code-tables is called the cover. Both frequent itemsets (e.g., 179) and non-frequent ones (e.g., 2345 or 2379) are used and the MDL principle takes a balance among them. The property of the cover of transactions is described by cover functions. Definition 2 (Cover function5 ). Let CT be the set of all possible code-tables over I, D be a database, t ∈ D be a transaction, and CT ∈ CT be a code-table. We call the set of itemsets {X | (X, cX ) ∈ CT } the coding set of CT , denoted by CS . A cover function on CT , cover : CT × 2I → 2CS , should satisfy: C1. if X, Y ∈ cover (CT , t) then it holds either X = Y or X ∩ Y = ∅, and 5 This definition is modified from Definition 2 of [18] by using a coding set CS . Edit Operations on Lattices for MDL-based Pattern Summarization 21 S C2. the union of all X ∈ cover (CT , t) equals to t, i.e., t = X∈cover (CT ,t) X. Code-tables in Fig. 1 contain additional information usage. Using coding methods in the Kolmogorov Complexity [9], codes are decided according to how often itemsets are needed : For X ∈ CT , usage(X) is defined as |{t ∈ D | X ∈ cover (CT , t)}|, which determines the least length of prefix codes required. Example 2 (from Example 1). In Fig. 1a, cover (ST , 179) = {1, 7, 9}, and in Fig. 1b cover (CT , 12789) = {179, 2, 8}. It holds that usage(179) = 3. If a transaction t ∈ D is encoded by a cover function cover (·, ·) and some CT , we say that t is covered. Then the MDL principle can be evaluated. Definition 3 (Lemma 1, Definitions 4 and 5 in [18]). The two-part MDL principle says that a code-table CT ∈ CT is the best for D if it minimizes L(D, CT ) = L(CT ) + L(D | CT ), where L(CT ) is the length of the code-table CT , and L(D | CT ) is that of D when CT is given. Following Definitions 1 and 2, terms L(CT ) and L(D | CT ) for L(D, CT ) are evaluated as follows: X L(CT ) = L(code ST (X)) + |code CT (X)|, (1) X∈CT if usage(X)6=0 X X L(D | CT ) = |code CT (X)|, (2) t∈D X∈cover (CT ,t) P where L(code ST (X)) = i∈X |code ST ({i})|. The Krimp algorithm [18] is an iterative algorithm updating a code-table CT ∈ CT greedy by adding a new frequent itemset X ∈ F with evaluating L(D, CT ). Please refer to Algorithm 3 in [18], which we follow in experiments. 2.3 Problems Models based on code-tables have been widely used in MDL-based studies. The cover function adopts the set union to represent transactions, and the subset operation to divide them as seen in Example 1. The process assumes that items are in the same granularity, but it is not the case in applications. It also disables us to allow for side information (i.e., our background knowledge and resources) about databases as well. Then our targeting problems are listed below. Problem 1. The cover function has a drawback of the difficulty of its generaliza- tion because it implicitly uses set operations. Problem 2. Code-tables cannot reflect side information of items (e.g., hierarchy, or degree of importance of items) because of implicitly used set operations. That is we need to revise set operations used in the existing method. Our key idea is to adopt terminologies of lattices to overcome these problems. In Section 3, we revise the cover function with lattices to build our new framework. We apply it to pattern concept lattices by pattern structures in Section 4 to confirm that our lattice-based framework is applicable to various patterns. 22 Keisuke Otaki and Akihiro Yamamoto 3 Models by Edit Operations for Closed Itemsets We provide our framework based on lattices. For discussions, we define the object concept corresponding to t ∈ D by γt = (g(t), t) from the relation in Table 1. Here we only consider closed itemsets. 3.1 Cover Using Edit Operations on Lattices In the cover of t ∈ D, t is recursively replaced with t \ B1 , t \ (B1 ∪ B2 ), . . . by choosing mutually disjoint closed itemsets B1 , B2 , . . . until t gets ∅ in some order6 . We now have the following. Observation 1 The cover of t ∈ D can be represented as a sequence of the deletion of items: t 7→ t \ B1 , t \ B1 7→ t \ (B1 ∪ B2 ), . . . , t \ (B1 ∪ · · · ∪ Bm ) 7→ ∅. Thus a transaction t ∈ D can be covered by also using the deletion of items. We use such deletions to construct a new cover based on lattices. On lattices, selections of Bi corresponds to selecting concepts from B, and intents of them can be stored in code-tables. With respect to the search space of the selection, the following holds by properties of lattices. Observation 2 For a transaction t ∈ D, only a sub-lattice corresponding to the upset ↑ γt of the corresponding object concept γt of t should be considered. Choosing concepts, however, is not complete for the loss-less property (i.e., C2 of Definition 2) because in general B does not contain all singleton itemsets in intents of concepts. In contrast, by the discussion of closed itemsets [12], all frequent itemsets could be recovered from closed itemsets and lattices should contain all information required. Our idea is extracting such hidden information on lattices as edit operations on lattice (e.g., the deletion of items in Observation 1), and represent t ∈ D by both the selections and edit operations. We first introduce edit information. Definition 4 (Edit Information). For c = (A, B) ∈ B, S the edit information, denoted by Edit(A, B), is defined as Edit(A, B) ≡ B \ (C,D)∈⇑(A,B) D. For a S set C of concepts, it is extended by Edit(C) = (A,B)∈C Edit(A, B). For c ∈ B, Edit(c) indicates items which cannot be represented by intents of concepts except c itself even when all concepts in ↑ c \ {c} are chosen. Example 3 (from Example 2). For t3 = 12789, it holds that γ3 = (3, 12789) and Edit(γ3) = {8}. The item 8 cannot be represented by any concepts except γ3. Therefore if γ3 is not selected the item 8 must be represented by another way. 6 The standard cover order is a basic one from the Krimp algorithm: For an itemset B ⊆ I, the standard cover order is the order of the cardinality |B| descending, the support freq(B) descending, and lexicographically ascending [18] Edit Operations on Lattices for MDL-based Pattern Summarization 23 Edit(1345,79)={7,9} (12345,{}) (12345,{}) Edit(1235,2)={2} Edit(1235,2)={2} (1345,79) (1345,79) Edit(134,179)={1} (1235,2) (1235,2) CT (134,179) (134,179) (135,279) (135,279) (13,1279) (12,25) (25,23) (13,1279) (12,25) (25,23) Edit(γ3)={8} Edit(γ3)={8} (1,125679) (5,2379) (1,125679) (5,2379) (3,12789) (2,2345) (3,12789) (2,2345) ({},123456789) ({},123456789) (a) For t3 = 12789 with ST (b) For t3 = 12789 with CT Fig. 2: Examples of the cover with edit operations for γ3 = (3, 12789) (i.e., t3 ). As another way, edit operations among intents of concepts are introduced to represent such information. Because B contains the top > = (G, ∅)7 and Int(γt) = t, all information for covering t should be found in ↑ γt. We use edit information to capture such information and to explain edit operations on lattices (i.e., on intents of concepts). We encode edit information by tuples called edit scripts, and store them in edit-tables by analogy with code-tables. Definition 5 (Edit scripts and Edit-tables). An edit-table ET consists of three columns: kinds of operations, arguments of operations (itemsets), and codes. That is, ET = {(t1 , X1 , c1 ), . . . , (tm , Xm , cm )}, where ti indicates the type of operations, and both Xi and ci is the same to code-tables. The tuple (t, X, cX ) is called an edit script. Let E be the set of all edit scripts. Example 4 (from Example 3). Because Edit(γ3) = {8}, the item 8 should be represented by the deletion of the item 8. Let DEL represent the deletion 8 . A script (DEL, 8, c8 ) with a code c8 means the deletion of the item 8. Definition 5 is general and abstract. For closed itemsets, however, we only need a specific kind of them. We define a function Encode : 2I → 2E for X ⊆ I by Encode(X) = {(DEL, i, ci ) | i ∈ X, ci ∈ C}, where C is the set of codes. Lemma 1. It holds that Edit(↑ γt) = t for any t ∈ D. Proof If ↑ γt = {γt, >}, it holds that Edit(↑ γt) = Edit(γt) = t. Then Encode(·) generates edit scripts of all items in t as the deletion of an item. By the induction on lattices (by ⇑ c), we only need to confirm that for c ∈ B all items which cannot be covered by intents of concepts in ⇑ c \ {c} can be represented by edit scripts. This is done by the definition of Edit(c) and properties of hB(K), i. t u Lemma 1 guarantees that any t ⊆ I can be represented by using edit scripts when no concepts are chosen. Now the cover using lattices can be described using a function cover L (·, ·) by analogy with cover (·, ·). The following describes how to implement it instead of giving a new formal description of the function. 7 If B 6= ∅ for the top > = (G, B) of hB(K), i, we can insert a dummy (G, ∅). 8 The type is required to introduce several types of edit operations if needed. 24 Keisuke Otaki and Akihiro Yamamoto I Definition 6 (Revised Cover). The function cover L : CT × 2I → 22 × 2E re- ceives a code-table CT containing intents of selected concepts and a transaction t, and it returns a set of itemsets used (e.g., 179) and that of edit scripts. For a code-table CT (See Fig. 2b), we first choose intents of concepts from CT in some order, and denote by C the set of selected concepts, which should be excluded from the search space ↑ γt. The remainder should be represented by edit scripts. Then cover L (CT , t) = ({Int(c) | c ∈ C}, Encode(Edit(↑ γt\ ↑ C))). Example 5 (from Example 3). It holds that cover L (CT , t3 ) = ({179}, {(DEL, 2, c2 ), (DEL, 8, c8 )}), representing the cover of t3 in Fig. 2b. It holds that cover L (ST , t3 ) = {∅, {(DEL, 1, c1 ), (DEL, 2, c2 ), (DEL, 7, c7 ), (DEL, 8, c8 ), (DEL, 9, c9 )}}, which sim- ulates the cover of t3 by singleton itemsets (in Fig. 2a). If the set C in Definition 6 is the empty set ∅, the function cover L (·, ·) achieves the same result as the function cover (·, ·) with ST , supported by Lemma 1 (See Example 5). If some concepts are selected, the following lemma is important. Lemma 2. Let C be the set in Definition S 6 and (X , Y) = cover L (CT , t). The set Y is complete to rerpesent t \ X∈X X. Proof From Definition 6, if a concept c = (A, B) in ↑ γt is selected and stored in CT (i.e., c ∈ C), any item i ∈ B is not represented by edit scripts in Y because there exists no conceptS d ∈↑ γt\ ↑ C satisfying i ∈ Edit(d). It holds that {i | i ∈ Edit(↑ γt\ ↑ C)} = t \ X∈X X, and the function Encode(·) ensures the lemma, i.e., the cover by cover L (·, ·) is valid when some concepts are chosen. Together with the completeness in Lemma 1 and Lemma 2, our edit scripts are enough to represent any t ∈ D with selecting concepts in C. 3.2 Model Evaluation Using Edit Operations For our model (CT , ET ), with reference to Section 2.2, the MDL principle L(D, CT , ET ) = L(CT , ET )+L(D | CT , ET ) can be evaluated. For L(CT , ET ), X L(CT , ET ) = L(code ST (X)) + |cX | (X,cX )∈CT usage(X)6=0 X + L(code OP (t)) + L(code ST (X)) + |cX |, (3) (t,X,cX )∈ET usage(X)6=0 where code OP (t) represents codes used forPdistinguishing types of edit scripts. For itemsets, L(code OP (t)) = 0. The first P term of Equation 3 is the same to the Krimp algorithm, and the second term is for encoding the edit-table ET . The difficulty of representing databases, L(D | CT , ET ), is evaluated by naturally generalizing the corresponding term of the Krimp algorithm below. Edit Operations on Lattices for MDL-based Pattern Summarization 25 T: (X, usage, code) 2345 1 2345 2379 1 2379 - (X, usage, code) 179 3 179 T (Type, X, usage, code) 25 1 25 2345 1 2345 2 1 2 2379 1 2379 DEL 2 1 DEL 2 6 1 6 179 3 179 DEL 6 1 DEL 6 8 1 ST 8 25 1 25 DEL 8 1 DEL 8 (a) Code-table T (= T − ∪ ST ) (b) Our model, two tables (CT , ET ) Fig. 3: A code-table T by the Krimp algorithm and our two tables (CT , ET ). Our table CT is equivalent to T − and ET simulates ST . X X X L(D | CT , ET ) = |cX | + |cX | t∈D (X,cX )∈X (t,X,cX )∈Y for (X , Y) = cover L (CT , t). (4) We illustrate an example in Fig. 3 to show how edit-tables intuitively work. 3.3 The Equivalency of Models for Closed Itemsets For closed itemsets, we now have the following proposition. Proposition 1 Let T be an obtained code-table by the Krimp algorithm for the database D with some order on CF. With the same order, our model can achieve a pair (CT , ET ) satisfying L(D, T ) = L(D, CT , ET ) for closed itemsets. Proof (Sketch) At the beginning, it holds that T = ST (i.e., T − = ∅) and CT = ∅, where all transactions should be covered by singleton itemsets. From the definition of Encode(·) and Lemma 1, the proposition immediately holds. If a closed itemset B ∈ CF is selected and stored as B ∈ T , there exists a concept (A, B) ∈ B and its intent B is stored in our table CT . Because we use the same order in the cover of t ∈ D, for (X , Y) = cover L (CT , t), it holds that X = cover (T, t), meaning that the usage in T − and CT must be identical. The usage of edit scripts must be the same as well by the fact X = cover (T, t), Lemmata 1, 2, and the definition of Encode(·). t u Summary Our idea is to revise the cover in the Krimp algorithm with lat- tices. Our framework divides code-tables into a part of intents of concepts (i.e., closed itemsets) and that of edit scripts simulating singleton itemsets. For closed itemsets, our model can achieve the same result as that obtained by the Krimp algorithm (Proposition 1), which are supported by properties of lattices. The advantage of our lattice-based model is that lattices help us to generalize the MDL-based evaluation for various patterns. For example, pattern concept 26 Keisuke Otaki and Akihiro Yamamoto lattices [4] enable us to address the problem for several patterns with the MDL principle. The key concept for the generalization is to introduce edit operations on lattices according to patterns, encode them as edit scripts in edit-tables, where lattices become the search space in our framework. Lattices help us to consider the completeness of edit operations as well. 4 A Generalized Model via PSA We apply our framework to pattern concept lattices. As an example of other lat- tices, we consider GenSets, a generalized version of itemset using pattern struc- tures, which are introduced to analyze non-binary data [4]. We call methods using them for Knowledge Discovery Pattern Structure Analysis (PSA) below. 4.1 Pattern Structures and Generalized Sets Pattern structures extend FCA using meet semi-lattices; a meet semi-lattice (D, u) is a pair of a set D of descriptions representing fragments of objects as their features, and a meet operator u between descriptions satisfying the associativity, the commutativity, and the idempotency. A partial order v of descriptions is induced by u for x, y ∈ D as x v y whenever x u y = x. A pattern structure P is a triple P = (G, (D, u), δ), where G is a set of objects, (D, u) is a meet semi-lattice, and a mapping δ : G → D gives a descriptions for each object. In PSA we use the following Galois connection (·) : A = ug∈A δ(g) for A ⊆ G and d = {g ∈ G | d v δ(g)} for d ∈ D. A pair (A, d) ∈ 2G × D is a pattern concept if and only if A = d and d = A. A partial order among pattern concepts is defined as well, and pattern concept lattices can be constructed in a similar manner of FCA. Recent applications of PSA and lattices constructed by PSA can be seen in [3, 11]. We consider a slight modification of FCA by giving additional information to itemsets from the following scenario. Example 6 (from Example 1). Let us consider t4 = 179 of id i4 and t5 = 2379 of id i5 . We have (i4 i5 )0 = 79 computed by i04 ∩ i05 = 179 ∩ 2379 = 79. The remainders for each transaction are 1 = 179 \ 79 and 23 = 2379 \ 79, respectively. In this case, the existence of these items (i.e., 1 in t4 and 2, 3 in t5 ) is ignored although they are useful for the representation of itemsets. From this example, by using a meet semi-lattice, we introduce auxiliary num- bers representing the existence of items, i.e., the number of items may be included at most in common, to enhance the expressiveness of itemsets. Definition 7 (GenSets). We define D = 2I × N. For an itemset B ⊆ I, we encode it by a pair (B, 0) ∈ D. The meet operator u is defined as (B1 , n1 ) u (B2 , n2 ) = (B1 ∩ B2 , min(n1 + |B1 \ B1 ∩ B2 |, n2 + |B2 \ B1 ∩ B2 |)). Edit Operations on Lattices for MDL-based Pattern Summarization 27 We call itemsets with additional numbers GenSets (generalized sets of items) 9 . Example 7. For two itemsets t4 = 179, t5 = 2379, we encode them as d1 = (179, 0) and d2 = (2379, 0), respectively. Then d = d1 u d2 is computed as d = (B, n) = (79, 1), where n = 1 means that both itemsets t4 and t5 possibly have further one item. In other words, a GenSet (79, 1) can be regarded as {7, 9, ?} with a wildcard symbol ? representing any item in I. We can interpret GenSets as a richer representation, and conjecture that they are helpful to capture characteristic features of itemsets. In other words, we now consider to take care of both the extent and the intent of concepts in evaluating the MDL principle. However, the Krimp algorithm cannot deal with GenSets directly because the symbol ? is not in I and it has completely different meaning. Therefore, applying the ordinal cover is not suited and a new formulation is required from scratch. In contrast, our lattice-based model can be applicable. We can evaluate the two-part MDL principle for GenSets by only considering the cover function cover L (·, ·) using edit operations including ?. 4.2 The two-part MDL principle for GenSets We can evaluate the two-part MDL principle only by considering the function cover L (·, ·) using edit operations based on pattern concept lattices. Models for GenSets For a GenSet (B, n) ∈ D, we need to encode n to store GenSets in a code-table, which is possible by adding a column for storing the code for n. There exist several codes for numbers (e.g., arithmetic codes). We choose a simple one, which requires the length − log2 (1/P ) bits where P is the number of possible candidates, and n is bounded by the minimum size of transactions in D. Assuming this code, we set P = |I|, the worst length of such representation. The Cover with GenSets For d = (B, n) ∈ D, if n = 0, d is an itemset B. We focus on how to define the cover with GenSets when n > 0. Example 8. Let d = (B, n) = (79, 1) and t3 = 12789. We first cover a part of t3 by the itemset B and get 128 = t3 \ B. Additionally we rewrite a symbol ? (n = 1) for {1, 2, 8}. An item 8 is chosen and code ST (8) is used for this choice because the code of 8 in ST is longer than that of 1 and 2. Please see Fig. 4 as well for the cover of t3 using a GenSet d = (79, 1) ∈ D. As the symbol ? subsumes any items in I we need to evaluate such subsumptions. To implement it, we assume that we would like to find general, common features which characterize many transactions by excluding non-frequent items as long 9 For three sets X, Y, Z ∈ 2I we want to compute n = min(|X| − |XY Z|, |Y | − |XY Z|, |Z| − |XY Z|), where XY Z = X ∩ Y ∩ Z, and we see that this value n can be computed by the counting and set operations. 28 Keisuke Otaki and Akihiro Yamamoto (12345,({},3)) GenSet (79,1) Traverse (1345,(79,1)) selected additional 1 item (1235,(2,3)) (134,(179,0)) (135,(279,1)) item 8 of (13,(1279,1)) longest-code (12,(25,2)) (25,(23,2)) (3,(12789,0)) (1,(125679,0)) (5,(2379,0)) (2,(2345,0)) Fig. 4: For t3 = 12789 with CT = {(79, 1)}, where (79, 1) is a selected GenSet. An item can be covered by the GenSet (79, 1) and we need to find it except 79. as possible. We implement the subsumption of ? as replacing the symbol ? with i ∈ I by finding the item i having the longest code ST (i). Instead of giving a formal description, we give sketches of our idea below. Sketch 1 We prepare a type SUB and construct an edit script as (SUB, i, ci ). As we have two types (DEL and SUB), L(code OP (t)) in Equation 3 is used. The function Encode(·) should be revised with traversing of an item i having the longest code. For example in Fig.4, cover L ({(79, 1)}, 12789) now should work like ({(79, 1)}, {(DEL, 1, c1 ), (DEL, 2, c2 ), (SUB, 8, c8 )}) after traversing i = 8. Sketch 2 For d = (B, n), if n > 0 (i.e., the code of n shows non-zero value), we use a code code ST ({i}) for representing the item i of the longest code in ST with- out edit scripts. This is done by extending the function cover L (·, ·), as a function like cover new L ({(79, 1)}, 12789) = ({(79, 1)}, {(DEL, 1, c1 ), (DEL, 2, c2 )}, {8}), for example, where {8} can be immediately evaluated by ST computed from D. In dealing with pattern concept lattices, many ways are applicable to fulfill the loss-less property. Because both sketches are based on our edit-tables for itemsets in Section 3 and they adopt prefix codes, they achieve the loss-less property if they have codes for items subsumed by ?. As they have such codes in fact, the loss-less property holds for GenSets. In experiments we use Sketch 2. Corollary 1 (The Loss-Less Property of GenSets) The loss-less property immediately holds for GenSets from discussions above. We stress that lattices are compact representations of objects and their fea- tures, and they help us to prepare edit operations (i.e., edit scripts) and the cover function. Furthermore, we can introduce more labels in subsumptions of ? for representing various information to take into account valuable side information in pattern mining. In addition, we expect that GenSets can be used not only for the symbol ? but also for representing hierarchy by stacking GenSets as trees. Edit Operations on Lattices for MDL-based Pattern Summarization 29 Table 2: Benchmark datasets represented in the form of contexts (G, M, I). Re- call that CF is the set of all closed itemsets with θ = 1. objects |G| attributes |M | |I|/(|G||M |) |CF| Iris 150 19 0.263 163 breast 699 16 0.624 641 pima 768 38 0.237 3,204 wine 178 68 0.206 14,555 5 Experiments From the UCI Machine Learning repository [10], we use iris10 , breast11 , pima12 , and wine13 , which are also used in the Krimp algorithm [18] as benchmark datasets. All of those datasets are available from the distribution package14 of the Krimp algorithm. The statistics of those datasets are summarized in Table 2. All of our experiments were made on a machine of Mac OS X 10.10.2 with 2 × 2.26 GHz Quad-Core Xeon processors on 64GB main memory. All codes were written in Python 2.7.9. Our algorithm was the same as Algorithm 3 in the paper [18], which is an algorithm iteratively and greedy updating a code-table, except the part of the MDL evaluation. In our framework we replaced it with our implementation of Equations 3 and 4 based on lattices and edit operations. We did not discuss the computational complexity of our framework seriously. Rather than discussing solely run-times of experiments, we saw the followings. – Comparison of two models: Results from two-models become the same by the equivalency of them (Proposition 1) for itemsets. We then investigate how much computational time our new model increases. – We analyze the result with GenSets to check the application possibility of our lattice-based model for other patterns. Let CT be the obtained code-table after applying the Krimp algorithm. For a database D we can compute both L(D, ST ) and L(D, CT ). We then measured the ratio L(D, CT )/L(D, ST ) (used in [18]), which can be regarded as a kind of the degree of compression achieved by CT , denoted Ratio in results. In addition we also measured the size |CT − | because it indicates how many concepts are chosen out of the whole set of enumerated patterns. Furthermore, in experiments with GenSets, we measured the Jaccard index : J(X, Y ) ≡ |X ∩ Y |/|X ∪ Y | for two sets X and Y , between a set of identifiers of closed itemsets CF and that of GenSets, to see the difference between results by itemsets and those by GenSets. Note that in our setting, the set of pattern concepts by GenSets can be regarded as a set of closed itemsets if we ignore additional numbers n in (B, n) ∈ D. 10 https://archive.ics.uci.edu/ml/datasets/Iris 11 https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) 12 https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes 13 https://archive.ics.uci.edu/ml/datasets/Wine 14 http://www.patternsthatmatter.org/software.php, confirmed Feb. 2015. 30 Keisuke Otaki and Akihiro Yamamoto Table 3: Results for itemsets by code-tables (the Krimp algorithm) and two- tables (concepts and edit-tables in our framework). Krimp I-Edit |CF| |CT − | Ratio Pre-proc. [s] Run-time [s] Pre-proc. [s] Run-time [s] iris 163 13 0.456 0.025 0.234 0.021 0.520 breast 641 29 0.181 0.580 6.263 0.573 26.794 pima 3,204 66 0.356 18.202 47.525 19.107 373.63 wine 14,555 72 0.757 442.686 393.539 444.842 1354.570 In results, we labeled by Krimp to show results by the Krimp algorithm, by I-Edit to show results by our framework using lattices and edit operations, and by Gen-Edit for results by GenSets using pattern concept lattices. 5.1 Results and Discussions for Itemsets We show in Table 3 results of experiments for itemsets based on the existing method Krimp and our framework I-Edit, where we focus on run-times only. Note that we, of course, confirmed that results are equivalent. With respect to the time for pre-processing, there existed no large differences. Concerning the time for the iterative updating process of models, roughly speak- ing our new model is at most 10 times slower than code-tables. Because both models are not implemented by sophisticated manners, we conjecture that our implementation cannot exploit pre-processed information on lattices well. That is, computing edit-tables using lattices is time-consuming and not sophisticated. Because the part of edit operations apparently depends on implementations of lattices and data structures for concepts, we expect that some advanced data structures help us to overcome this bottleneck. 5.2 Results and Discussions for GenSets We conducted the same experiments using the same datasets with GenSets (in Section 4.1) and our MDL evaluation (in Section 4.2). Results are summarized in Table 4. We can see that the numbers of selected concepts via GenSets (i.e., 12, 24, 62, and 40) are lower than those via itemsets (i.e., 13, 29, 66, and 72), which mean that obtained models via GenSets of databases are more compact than those via itemsets. The same trend can be confirmed in the Ratio column. The run-times of experiments have become faster than those in Table 3. Because the numbers of concepts (itemsets) and pattern concepts (GenSets) are the same now, these results should arise from the differences of the cover and the MDL evaluation. In addition the contents of the selected patterns should be affected. We can see the difference of the contents of the selection by checking the Jaccard index. As we can see that, our GenSets formulation surely generates completely different results compared with the results for itemsets. Edit Operations on Lattices for MDL-based Pattern Summarization 31 Table 4: Results by our lattice-based framework for GenSets. Gen-Edit |CF| |CT − | Ratio Pre-proc. [s] Run-time [s] Jaccard Index iris 163 12 0.428 0.009 0.550 0.785 breast 641 24 0.168 0.351 20.666 0.293 pima 3,204 62 0.342 18.92 285.676 0.123 wine 14,555 40 0.686 415.376 454.631 0.131 Furthermore, the existence of stars ? in GenSets affect the results. For GenSets (in Section 4.2), we aggressively cover items having longer codes in ST if the symbol ? appears. As less frequent itemsets have longer codes in the table ST and such items are covered by ? in advance, concepts stored in our model have relatively high frequency in the database, and the results get more compact. Concluding Remarks on Experiments, and Future Directions We con- ducted the MDL-based pattern summarization for selected benchmark datasets by using code-tables and our model (two tables). We also applied our model for GenSets to which we can apply the same framework. At least we confirmed that our model obtained more compact results using GenSets (checked by Ratio), and the contents are also different (done by Jaccard index). With respect to the quality of the choice of concepts via the two-part MDL principle it is always open to discuss. To discuss the quality, the previous work [18] developed the classifier using code-tables, tested it for labeled data, and discussed the quality of their choice. Therefore following their strategy and de- veloping a classifier using edit operations and lattices are our future work to consider the pros and cons of our model and the quality of the choice. Further discussion on the relation of run-times for edit-tables and properties of lattices (e.g., the density of contexts), is also important. Because the order of checking the set CF is optional, developing and analyzing further heuristic based on lattices is also important. We expect that several terminologies (not used in this paper) and methods for lattices would be helpful for this direction. 6 Conclusions and Future Work In this paper we propose a new framework for pattern summarization to con- quer the redundancy problem for various patterns by combining lattices and edit operations, which are evaluated by the two-part MDL principle. Our ex- periments show that our model successfully extends an existing model towards more general patterns. As an example we confirm that our model is applicable to GenSets, defined by pattern structures as a richer representation of itemsets. For closed itemsets, our model achieves the same results as the existing model (Proposition 1). Results of Ratio in experiments, GenSets are useful to obtain more compact representations of database compared with an existing model. 32 Keisuke Otaki and Akihiro Yamamoto In our future work, we plan to develop a classifier based on the MDL principle by following the existing work for discussing the quality of the selection. We also investigate encoding methods and heuristic for the cover towards various patterns together with side information. Theoretical analysis of the two-part MDL evaluation via lattices is also our important future research direction. Acknowledgements. The authors are grateful to the anonymous reviewers for valuable comments. This study was partially supported by Grant-in-Aid for JSPS Fellows (26-4555) and JSPS KAKENHI Grant Number 26280085. References 1. Aggarwal, C.C., Han, J. (eds.): Frequent pattern mining. Springer International Publishing (2014) 2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proc. of the 20th VLDB. pp. 487–499 (1994) 3. Buzmakov, A., Egho, E., Jay, N., Kuznetsov, S.O., Napoli, A., Raı̈ssi, C.: The representation of sequential patterns and their projections within formal concept analysis. In: Workshop Notes for LML (ECML/PKDD2013) (2013) 4. Ganter, B., Kuznetsov, S.O.: Pattern structures and their projections. In: Proc. of the 9th ICCS. pp. 129–142 (2001) 5. Ganter, B., Wille, R.: Formal concept analysis - mathematical foundations. Springer (1999) 6. Geerts, F., Goethals, B., Mielikäinen, T.: Tiling databases. In: Proc. of the 7th DS. pp. 278–289 (2004) 7. Grünwald, P.D.: The minimum description length principle. The MIT Press (2007) 8. Koutra, D., Kang, U., Vreeken, J., Faloutsos, C.: VOG: Summarizing and under- standing large graphs. In: Proc. of the 22nd SDM. pp. 91–99 (2014) 9. Li, M., Vitnyi, P.M.: An introduction to kolmogorov complexity and its applica- tions. Springer Publishing Company, Incorporated, 3 edn. (2008) 10. Lichman, M.: UCI machine learning repository (2013), http://archive.ics.uci.edu/ml 11. Otaki, K., Ikeda, M., Yamamoto, A.: Pattern structures for understanding episode patterns. In: Proc. of the 11th CLA. pp. 47–58 (2014) 12. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed item- sets for association rules. In: Prof. of the 7th ICDT. pp. 398–416 (1999) 13. Pei, J., Han, J., Mao, R.: CLOSET: An efficient algorithm for mining frequent closed itemsets. In: Proc. of the DMKD2000. pp. 21–30 (2000) 14. Smets, K., Vreeken, J.: Slim: directly mining descriptive patterns. In: Proceedings of the 20th SDM. pp. 236–247 (2012) 15. Tatti, N., Vreeken, J.: Discovering descriptive tile trees. In: Proc. of the ECML/PKDD 2012. pp. 24–28 (2012) 16. Tatti, N., Vreeken, J.: The long and the short of it: Summarising event sequences with serial episodes. In: Proc. of the 18th KDD. pp. 462–470 (2012) 17. Uno, T., Asai, T., Uchida, Y., Arimura, H.: LCM: An efficient algorithm for enu- merating frequent closed item sets. In: Proc. of the FIMI’03 (2003) 18. Vreeken, J., van Leeuwen, M., Siebes, A.: Krimp: Mining itemsets that compress. Data Mining and Knowledge Discovery 23(1), 169–214 (2011) An Approach Towards Classifying and Navigating RDF data based on Pattern Structures Mehwish Alam and Amedeo Napoli LORIA (CNRS – Université de Lorraine) BP 239, 54506 Vandoeuvre-les-Nancy, France {mehwish.alam,amedeo.napoli@loria.fr} Abstract. With an increased interest in machine processable data, more and more data is now published in RDF (Resource Description Frame- work) format. This RDF data is present in independent and distributed resources which needs to be centralized, navigated and searched for do- main specific applications. This paper proposes a new approach based on Formal Concept Analysis (FCA) to create a navigation space over semantic web data. This approach uses an extension of FCA and takes RDF triples and RDF Schema present on several independent sources and provide centralized access over the data resulting from several re- sources. Afterwards, SPARQL queries can be posed over this navigation space to access these distributed resources from one platform for infor- mation retrieval purposes. Keywords: Formal Concept Analysis, Navigation Space, Linked Open Data, RDF, RDF Schema, Semantic Pattern Structures. 1 Introduction With the efforts of Semantic Web community many technologies have been of- fered for publishing machine-readable data on web. It annotates textual data with meta-data and makes it available in the form of ontologies and RDF graphs. One of the emerging source of such data are published in the form of Linked Open Data (LOD) cloud [2]. As a contrast to textual resources, LOD does not need extensive preprocessing as it is already annotated in the form of entities and relationships. This structured format leads to other kind of challenges. One of the ba- sic characteristics of Semantic Web is that it follows a decentralized publica- tion model, meaning that the RDF graphs are published in several distributed resources, instead of creating one knowledge-base of statements any one can contribute new statements and make it publicly available. These resources have nothing in common except some shared terms. These decentralized graphs should be integrated through machine/software agents to provide domain specific ap- plications. Moreover, external schemas in the form of ontologies or taxonomies c 2015 for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. Local Proceedings also appeared in ISBN 978-84-606-7410-8, Universidad de Málaga (Dept. Matemática Aplicada), Spain. 34 Mehwish Alam and Amedeo Napoli can be linked to these data to make sense based on real world conception. Some of the resources may only consist of ontological information and may not con- tain the instantial information such as SWRC ontology [11] and some resources may only contain instantial information such as DBLP. The problem of how to provide one platform for accessing several and related semantic web resources to build domain specific applications still persists. This paper focuses on how to link several related heterogeneous RDF resources present on distributed loca- tions containing RDF data and RDF Schema into one space which can further be navigated and searched by the end user. We call it a “navigation space” because it provides centralization over RDF triples and also over RDF Schema belonging to several resources. This space provides navigation and querying over RDF triples as well as RDF Schema. The current paper introduces a new framework based on Formal Concept Analysis [8] which focuses on how a navigation space is created over RDF data by taking into account an existing RDF Schema to provide search capabilities over distributed and heterogeneous Semantic Web resources. In the current study we only consider a part of RDF Schema which keeps subclass-superclass rela- tion. Other constructs such as sub-property and relation between classes are not considered here. FCA alone may not be able to handle the complex data such as RDF data and RDF Schema as it considers only binary data. To deal with this shortcoming, we employ an extension of FCA, namely Pattern Structures [7]. We propose a new framework called Semantic Pattern Structures, that takes RDF triples and the RDF Schema as an input and produces a “navigation space”. This navigation space provides centralization over distributed RDF resources and keeps a partially ordered organization of the classes of RDF triples with respect to RDF Schema from data sources which may or may not be part of LOD. The lattice structure allows for querying over RDF graph w.r.t. subject, predicate and object and makes use of the partial order relation between the concepts. This enables simultaneous search over schema and the facts contained in an RDF graph. This navigation space allows querying using well-documented and well-know query language, SPARQL. Several data sources are simultane- ously queried through one platform. To date this is the first attempt to deal with RDF graph and RDF Schema using pattern structures. The paper is structured as follows: section 2 gives the motivating example, section 3 introduces the basic notion of Pattern Structures with an example. Sec- tion 4 details the framework of semantic pattern structures for creating schema rich navigation space over RDF data. Section 5 defines the querying mechanism over the obtained navigation space. Section 6 describes the experimental results of the framework. Section 7 discusses the related work while Section 8 concludes the paper. 2 Motivating Example For instance, if a patient has been prescribed some Cardiovascular Agent (CVA) and after a while he is looking for a replacement for his drug because it causes Towards Semantic Indexing over RDF data based on Pattern Structures 35 some allergic condition as a side effect. For finding solution, the doctor should be able to access this information for obtaining the suggestions for replacing patient’s drug which is used for the same purpose but does not cause the side effect reported by the patient. All the required information is present in different data sources: – Drugbank1 contains information about drugs with their categories and the proteins it targets. – SIDER2 contains drug with their specific side effects. – The category of drugs CVA exists in MeSH 3 . – Allergic conditions which is a class of side effects exists in MedDRA4 . The information required by the domain expert cannot be obtained by posing a simple query on Google. SPARQL queries over each resource should be written to obtain this answer, which further needs manual work to obtain the required suggestions as the answers are in the form of list. In order to access such kind of information, all the data sets must be accessed simultaneously and there should be a single information space for accessing all the desired information through navigation and simple SPARQL queries. In the current study, we propose a generic framework that provides one space for accessing several data sources related to some domain. The framework pro- posed in this study is generic because it can be applied to any domain having related information scattered over several resources. Based on the requirements of the user and the applications, the navigation space can be defined on a limited number of factual RDF triples. For example, in case of a cardiologist, he will only be interested in Cardiovascular Agents or some related categories of drugs. This way the application will include the drugs which are Cardiovascular Agents and its related categories. Only a subset of datasets should be considered according to the requirements of the applications. These specification are described by the domain expert as a starting point for narrowing down the huge cloud of Linked Data present on-line. 3 Preliminaries 3.1 Linked Open Data Recently, Linked Open Data (LOD) [2] has become a standard for publishing data on-line in the form of Resource Description Framework (RDF)5 which can further be linked to other data sources published in the same format. The idea underlying RDF is based on storing statements about a particular re- source, where each statement is represented as xsubject, predicate, objecty. A 1 http://www.drugbank.ca/ 2 http://sideeffects.embl.de/ 3 http://www.ncbi.nlm.nih.gov/mesh/ 4 http://meddra.org/ 5 http://www.w3.org/RDF/ 36 Mehwish Alam and Amedeo Napoli set of linked statements is referred to as RDF graph which constitutes an RDF triple store. For instance, Table 2 keeps RDF triples for drugs with their side effects and drug categories. The prefixes and full forms of all the abbreviations used in this paper are shown in Table 1. Consider t1 i.e., xs1 , p1 , o1 y, here s1 is subject, p1 is predicate and o1 is the object. A URI U in an RDF graph is a general form of URL. The actual representation of the drug s1 in the form of URI is 1 http://wifo5-04.informatik.uni-mannheim.de/drugbank/ resource/drugs/DB006691 , which is a URI of the DrugBank provided by Uni- versity of Mannheim containing all the information about the drug in the form of triples. DB00669 is the id provided by DrugBank to drug s1 and the name of the drug is obtained by using the rdfs:label predicate defined in RDFS vocab- ulary6 . The subject denotes the resource and the predicate denotes properties of the resource and defines relationship between the subject and the object. In the rest of the paper we use dereferenced resources i.e., s1 instead of complete URI. Practically, we consider RDF data w.r.t. three levels according to the value of predicate. – An RDF triple is at the schema level when the predicate has a reserved value from RDFS vocabulary such as rdfs:subClassOf, rdfs:Class (rep- resented as sc and Class respectively in the rest of the paper). The keyword rdfs:Class is used to declare that a set of resources is constituting a class and rdfs:subClassOf states that all the instances of one class are the in- stances of another class. For example, in Table 2, the schema level is given by triples t6, t8. An example of the tree structure contained in an RDF schema (in the rest of the paper when we use RDF Schema/schema we mean the tree structure created by rdfs:subClassOf) related to the drug side effect and their categories is shown in Figures 1 and 2 respectively. – An RDF triple is at the type level if it keeps the mappings from an instance to its class and states that a resource is an instance of a class. The RDF triple is at this level when the predicate has a reserved value from RDF vocabulary such as rdf:type (represented as type in the rest of the paper). For example, in Table 2, the type level is given by triples t4, t5, t7. – An RDF triple is at factual level when the predicate is related to the domain knowledge and is not a keyword from RDF vocabulary. For example, in Table 2, the factual level is given by t1, t2, t3. Table 2 keeps RDF triples from several resources. The information about the side effect of the drugs shown in triples t1, t2 is present on SIDER7 database which keeps the side effects of the drugs contained on the packaging of the drug. The information about the categories of the drugs shown in triples t3 is present on DrugBank8 , which keeps detailed information about each of the drugs and the proteins it targets. The schema level information about the drug side effects shown in t5, t6 are present in The Medical Dictionary for Regulatory Activities 6 http://www.w3.org/TR/rdf-schema/ 7 http://sideeffects.embl.de/ 8 http://www.drugbank.ca/ Towards Semantic Indexing over RDF data based on Pattern Structures 37 Abbreviation Term Abbreviation Term p1 http://wifo5-04.informatik.uni-mannheim. p2 http://wifo5-04.informatik.uni-mannheim. de/sider/resource/sider/sideEffect de/drugbank/resource/drugbank/category s1 Sumatriptan C1 Acute Coronary Syndrome s2 Tadalafil C4 Setevens-Johnsons Disorders s3 Theophylline C2 Tachycardia s4 Ticlopidine C3 Urticaria s5 Vardenafil C5 Erythmia Multiforme D1 Fibrinolytic Agents C6 Coronary Artery Disorders D2 Vasoconstrictor Agents C7 Cardiac Arrhythmias D3 Vasodilator Agents C8 Urticarias D4 Fibrin Modulating Agents C9 Allergic conditions NEC D5 Cardiovascular Agents C10 Allergic conditions D6 Molecular Mechanisms of Pharmacological Action C11 Cardiac Disorders D7 Therapeutic Uses C12 Immune System Disorders Table 1: This table keeps the abbreviations of the terms used in the rest of the paper. tid Subject Predicate Object Provenance t1 s1 p1 o1 SIDER t2 s1 p1 C2 SIDER t3 s1 p2 D2 DrugBank t4 s1 type Drug DrguBank t5 o1 type C1 MedDRA t6 C1 sc C6 MedDRA t7 D2 type D5 MeSH t8 D5 sc D7 MeSH ... ... ... ... ... Table 2: RDF triples for the Drug Domain from several resources. (MedDRA) Terminology9 which is the international medical terminology. Finally the schema level information related to the drug category shown in triples t7, t8 is present in MeSH (Medical Subject Headings)10 is a controlled vocabulary thesaurus which along with other tree structures keeps classification of drug categories. 3.2 SPARQL A standard query language for accessing RDF graphs is SPARQL11 which mainly focuses on graph matching. A SPARQL query is composed of two parts the head and the body. The body of the query contains the Basic Graph Patterns (it is contained in the WHERE clause of the query). It is composed of complex graph patterns that may include RDF triples with variables, conjunctions, disjunc- tions and constraints over the values of the variables. These graph patterns are matched against the RDF graph and the matched graph is retrieved and manip- ulated according to the conditions given in the query. The head of the query is an expression which indicates how the answers of the query should be constructed. For instance, consider a simple SPARQL query SELECT ?x ?y WHERE { ?x p1 ?y}, then the basic graph pattern ?x p1 ?y will be matched against the triples containing p1 as predicate i.e., t1 and t2 (see Table 2). Thus the answer of the query will be (s1 , o1 ), (s1 , C2 ). 9 http://www.meddra.org/ 10 http://www.ncbi.nlm.nih.gov/mesh 11 http://www.w3.org/TR/rdf-sparql-query/ 38 Mehwish Alam and Amedeo Napoli 3.3 Pattern Structures Formal Concept Analysis [8] can process only binary context, more complex data such as graphs, RDF triples can not be directly processed by FCA. Some of the complex data require scaling for these to be considered as binary data. The concept lattice obtained by a binary context mixes between several types of attributes. Pattern structures [7], an extension of FCA, allows direct processing of such kind of complex data such as RDF triples (as we show in this paper). The pattern structures were introduced in [7]. A pattern structure is a triple pG, pD, [q, δq, where G is the set of entities12 , pD, [q is a meet-semilattice of descriptions D and δ : G Ñ D maps an entity to a description. More intuitively, a pattern structure is the set of objects with their descriptions with a similarity operation [ on them which represents the similarity of objects. This similarity measure is idempotent, commutative and associative. If pG, pD, [q, δq is the pattern structure then the derivation operators can be defined as: ę A˝ :“ δpgq f or A Ď G gPA d˝ :“ tg P G|d Ď δpgqu f or d P D Now the pattern concept can be defined as follows: Definition 1 (Pattern Concept). A pattern concept of a pattern structure “G, pD, [q, δ” is a pair pA, dq where A Ď G and d P D such that A˝ “ d and A “ d˝ , where A is called the concept extent and d is called the concept intent. Finally, the obtained concept lattice is called as pattern concept lattice, which keeps partially ordered relations between the pattern concepts. 4 Towards Semantic Pattern Structures The decentralization issue is raised when many entities are contributing state- ments and making it publicly available. Either they are publishing same data using different vocabularies or related data which is distributed over several resources. To provide centralized access over distributed resources this section discusses a framework called Semantic Pattern Structures. This new framework is called Semantic Pattern Structures (SPS) because it is based on Pattern Struc- tures and it deals with data in the form of triples and it classifies these triples with respect to RDF Schema present in the form of taxonomy. This way the final navigation space also keeps the semantic information i.e., background knowledge. The idea underlying the SPS is to create a concept lattice directly from an RDF graph and use the associated RDF Schema present on Linked Open Data to 12 We rename an object in Pattern Structures as an entity to avoid confusion with the object in an RDF triple. Towards Semantic Indexing over RDF data based on Pattern Structures 39 be accessed within one concept lattice. This concept lattice provides search and navigation capabilities over RDF data as well as over the RDF Schema to answer certain queries of the domain expert. The obtained lattice does not only provides access to several data sources but is also a materialization of the associated RDF Schema. For creating one navigation space over RDF as well as RDF Schema, the first task is to define RDF triples in terms of patterns structures i.e., specifying the entities, their descriptions and the mapping from entities to description. The second task is to select a suitable RDF Schema associated to these RDF triples from distributed data sources based on domain knowledge. After these two preprocessing steps, we define a similarity measure [ over two descriptions which we generalize to the set of descriptions. After defining the similarity measure, we explain how a semantic pattern concept lattice is built using this similarity measure. Finally, we interpret, provide navigation and querying capabilities over the obtained pattern concept lattice. 4.1 From RDF Triples to Semantic Pattern Structures Firstly, we represent factual RDF triples about certain domain as entities and their descriptions. According to section 3.3, pattern structures are given as pG, pD, [q, δq. The descriptions D are termed as semantic descriptions and are denoted as Ds . An RDF triple store containing factual level information consists of statements of the form xsi , pj , ok y, where i “ t1, . . . , nu, j “ t1, . . . , qu, k “ t1, . . . , mu. Then, the set of subjects si are mapped to the entities in G. As the set of entities G represent the subjects in triples, we represent it as S. Regarding the type of each object set which is the range of same predicate a suitable RDF Schema is obtained by locating the terms in the closely re- lated RDF Schema. This selection is done based on domain knowledge. RDF Schema contains many constructs such as property, sub-property etc. along with sc and information but in this work we use the RDF Schema predicates such as type and sc introduced in section 3.1. First, the mapping from object to its type is obtained such as xok , type, C1 y i.e., ok is an instance of class C1 . Each object is then replaced by its corresponding class i.e., ok is replace by C1 . Now, the considered taxonomy information used for defining the similarity measured is present in the RDF Schema by following the predicates such as rdf s : subClassOf, skos : broader e.g., xC1 , sc, C6 y, i.e., C1 is subclass of C6 . Note that we do not extract the trees, we follows this predicate while defining the similarity in the next section. For instance, to obtain generic drug categories such as cardiovascular agents or generic side effects such as allergic conditions we further add the class infor- mation related to each of the objects in the triples. For example, (ok ,type,C2 ) meaning that ok is an instance of class C2 . Afterwards, we select all the super classes of the current class in the RDF Schema. For example, for C2 , we have C7 , C11 and C13 . An RDF Schema for drug side effect and drug category is shown in Figure 1 and Figure 2 respectively. 40 Mehwish Alam and Amedeo Napoli D9 C13 D8 C11 C12 D6 D7 C6 C7 C10 D4 D5 C1 C2 C8 C9 C3 C4 C5 D1 D2 D3 Fig. 1: RDF Schema for Side Effects (MedDRA) Fig. 2: RDF Schema for Drug Category (MeSH) pT1 q. pT2 q. The pair ppj : Ck q gives a description dij P Ds . The mapping of entities to description δ : S Ñ Ds is given as follows: let si P S then δpsi q “ tdi1 , . . . , diq u where i P t1, . . . , nu and dij “ tpj : tC1 , C4 , . . . , Cm uu where j P t1, . . . , qu. Finally, the semantic pattern structures are given as pS, pDs , [q, δq. Consider the running scenario described before. As the drugs in the DrugBank and SIDER are same and are the subjects then the triples t1, t2, t3 in Table 2 are represented as pS, pDs , [q, δq where the entity S has the description δps1 q = {(p1 : tC1 , C2 , C3 u), (p2 : tD2 u) }. The descriptions are based on same subjects having different descriptions belonging to different resources i.e., side effect and category. Finally, with respect to the triples in Table 2, the subjects in the RDF triples are mapped to the entities, S “ {s1 , s2 , . . . }, the descriptions are given as Ds = {(p1 : tC1 , C2 , C3 , . . . }) , (p2 : tD1 , D2 , . . . u)}. Entities S Descriptions Ds s1 {(p1 : tC1 , C2 , C3 u), (p2 : tD2 u)} s2 {(p1 : tC1 , C4 , C5 u), (p2 : tD3 u)} s3 {(p1 : tC2 u), (p2 : tD3 u)} s4 {(p1 : tC3 , C4 , C5 u), (p2 : tD8 u)} s5 {(p1 : tC1 , C3 u), (p2 : tD2 u)} Table 3: RDF Triples as entities S and semantic descriptions Ds . 4.2 Similarity Operation ([) over Descriptions For defining the similarity operation over the semantic description Ds , we use the notion of least common subsumer (lcs). Here we re-write the definition of a least common subsumer of two classes in a RDF Schema T given in [9]. Definition 2 (Least Common Subsumer). Given a partially ordered set pS, ďq, a least common subsumer E of two classes C and D (lcs(C,D) for short) in a partially ordered set is a class such that C Ď E and D Ď E and E is least i.e., if there is a class E 1 such that C Ď E 1 and D Ď E 1 then E Ď E 1 . Towards Semantic Indexing over RDF data based on Pattern Structures 41 Example 1. Let us consider C3 and C5 . Then according to the RDF Schema in Figure 1, the common subsumers of C3 and C5 are C10 , C12 , C13 and lcspC3 , C5 q “ C10 . Similarity Operation over Set of Descriptions: For describing the similar- ity over set of descriptions, LCS is computed pairwise for classes in two different sets of description and then only the most specific classes are picked. Let s1 and s2 be two entities such that s1 , s2 P S and the mapping δ is given as follows δps1 q “ d1 “ tp1 : tC1 , C2 , C3 uu and δps2 q “ d2 “ tp1 : tC1 , C4 , C5 uu. The similarity is always computed w.r.t the same predicate, here p1 . A dual similarity measure is defined to ensure classification of triples with respect to objects as well as with respect to classes from RDF Schema. The similarity measure is computed based on the taxonomy given in Figure 1. According to the current approach a pairwise similarity measure is computed between the two sets of classes connected through some predicate. For example, δps1 q [ δps2 q “ xp1 : tC1 , C2 , C3 uy [ xp1 : tC1 , C4 , C5 uy. Meaning that δps1 q [ δps2 q “ p1 : xlcspC1 , C1 q, lcspC1 , C5 q, . . . y. Then, δps1 q [ δps2 q “ p1 : tC1 , C10 , C11 , C13 u. As we have C1 ď C11 ď C13 and C10 ď C13 then we pick only the specific classes i.e., C1 and C10 . Finally, δps1 q [ δps2 q “ p1 : tC1 , C10 u. Now let us consider the complete descriptions of the two entities i.e., δps1 q “ {(p1 : tC1 , C2 , C3 u), (p2 : tD2 u)} and δps2 q “ {(p1 : tC1 , C4 , C5 u), (p2 : tD3 u)}. Then, δps1 q [ δps2 q “ {(p1 : tC1 , C2 , C3 u), (p2 : tD2 u)} [ {(p1 : tC1 , C4 , C5 u), (p2 : tD3 u)}. The set of least common subsumers for p1 will be C1 , C10 . The least common subumer between p2 : tD3 u and p2 : tD2 u is p2 : tD5 u. Finally, the similarity between these two drugs will be {(p1 : tC1 , C10 u), (p2 : tD5 u)}. The proposed similarity measure fulfills the property of a similarity operator i.e., commutative, associative and idempotent [7]. The number of RDF Schema considered are equal to number of predicates in the entity-description representa- tion of an RDF graph. In the running example, we use two schemas to show that our method can also be easily applied to multiple RDF Schemas which are either independent or parts of the same schema distributed over several resources. 4.3 Building the Semantic Pattern Concept Lattice For building the semantic pattern concept lattice using the above similarity measure over several RDF Schemas, according to the running example, δps1 q [ δps2 q “ tpp1 : tC1 , C10 uq, pp2 : tD5 uqu. The semantic pattern concept lattice is built according to definition 1. For instance, ts1 , s2 ul “ tpp1 : tC1 , C10 uq, pp2 : tD5 uqu but tpp1 : tC1 , C10 uq, pp2 : tD5 uqu l “ ts1 , s2 , s5 u. This way we obtain K#2 (see Figure 3) with 3 drugs giving the concept a support of 3. Finally, the pattern concept lattice is built for Table 3 with respect to T1 (Figure 1) and T2 (Figure 2). This pattern concept lattice is shown in Figure 3 and the details of each of the pattern concept is shown in Table 4. 42 Mehwish Alam and Amedeo Napoli K#8 K#4 K#9 K#5 K#6 K#2 K#10 K#11 K#13 K#7 K#3 K#1 K#12 K#0 Fig. 3: Pattern Concept lattice for Drugbank K#ID Extent Intent K#1 ts1 u (p1 : tC1 , C2 , C3 u), (p2 : tD2 u) K#2 ts1 , s2 , s5 u (p1 : tC1 , C10 u), (p2 : tD5 u) K#3 ts2 u (p1 : tC1 , C4 , C5 u), (p2 : tD3 u) K#4 ts1 , s2 , s3 , s5 u (p1 : tC11 u), (p2 : tD5 u) K#5 ts1 , s3 u (p1 : tC2 u), (p2 : tD5 u) K#6 ts2 , s3 u (p1 : tC11 u), (p2 : tD3 u) K#7 ts3 u (p1 : tC2 u), (p2 : tD3 u) K#8 ts1 , s2 , s3 , s4 , s5 u (p1 : tC13 u), (p2 : tD8 u) K#9 ts1 , s2 , s4 , s5 u (p1 : tC10 u), (p2 : tD8 u) K#10 ts1 , s4 , s5 u (p1 : tC3 u), (p2 : tD8 u) K#11 ts2 , s4 u (p1 : tC4 , C5 u), (p2 : tD8 u) K#12 ts4 u (p1 : tC3 , C4 , C5 u), (p2 : tD9 u) K#13 ts1 , s5 u (p1 : tC1 , C3 u), (p2 : tD2 u) Table 4: Details of Pattern Concept lattice in Figure 3 4.4 Interpreting and Navigating the Pattern Concept Lattice This pattern concept lattice not only serves as a navigation space but also pro- vides clustering of RDF triples w.r.t. some Schema. Each of the pattern concepts is a cluster of triples. For instance, consider the simplest example in Table 4 i.e., K#1, where there is only one entity in the extent, then this cluster contains the triples tps1 , p1 , C1 q, ps1 , p2 , D2 q, . . . u. Here it can be seen that all the classes are connected to single subject S through two predicates. This space can further be navigated to provide the required information to the user without the techni- cal background of Semantic Web. Let us consider, if user is looking for drugs causing side effect C10 , the located pattern concept will be K#9 which contains all the drugs causing some allergic conditions. Afterwards, if the user is looking for some very specific allergic condition such as C3 then he can simply navigate downwards to obtain more specific concept i.e., K#10 which contains all the Towards Semantic Indexing over RDF data based on Pattern Structures 43 drugs which cause the side effect C3 . Let us consider the scenario described in section 2, here the user will be looking for cardiovascular agents (D5 ) which do not cause any allergic conditions. In this case first the concept containing all the cardiovascular agents are located i.e., K#4 afterwards the lattice is further navigated downwards to obtain the drugs causing C10 (allergic conditions) i.e., K#2. Now the drugs which are in K#4 and not in K#2 are the CVAs which do not cause allergic conditions. 5 SPARQL Queries over Pattern Concept Lattice For the user with the basic knowledge of SPARQL queries, this navigation space can be easily accessed by posing simpler queries. This sub-lattice gives additional information related to the query (detailed below). The obtained pattern concept lattice keeps the materialized RDF Schema meaning that it does not require the query to perform reasoning. It also allows for accessing several resources from one platform. This navigation space allows queries which are subject, predicate or object bound. In this section we describe simple as well as complex SPARQL queries that can be posed over the this navigation space. Simple queries refer to the queries with simple Basic Graph Patterns such a single triple pattern in the WHERE clause of the SPARQL query, on the other hand, complex queries allow joins and negations. 5.1 Simple Queries Simple queries include subject, predicate or object bound queries. Subject-bound queries refer to the queries where the value of the subject in an RDF triple xs, p, oy is known while the values of predicate, object and classes of objects are to be retrieved. Let a SPARQL query be SELECT ?p ?o WHERE {si ?p ?o}, where si is a known subject and a pattern concept be pA, dq such that si P A. The BGP in this query is (si ?p ?o), this BGP is matched against the RDF graph present in the pattern concept lattice. The answer of the query will be the object concept tsi ul “ tdi u where di P d where d is the intent of the concept pA, dq. Note that this object concept is already computed while building a pattern concept lattice. In other words, all the predicate-object pairs tdi u connected to the subject si in the RDF graph G are returned as a result. For example, if user wants to know complete information about the drug s1 i.e., its side effect and the category, then SELECT ?p ?o WHERE {s1 ?p ?o}. It will search for object concept in the pattern concept lattice given in Figure 3. For instance, sl 1 “ pp1 : tC1 , C2 , C3 u, p2 : tD2 uq (K#1) and the answer would be pp1 : tC1 , C2 , C3 u, p2 : tD2 uq. The first part of the answer pp1 : tC1 , C2 , C3 u belongs to SIDER while p2 : tD2 u belongs to DrugBank. However, because of the strong lattice-structure a sub-lattice is retrieved by navigating to more general concepts (upward navigation) present in the pattern concept lattice i.e., all the concepts connected to K#1 i.e., K#2, K#4, K#5, K#9, K#10, K#13, K#8. Here, K#1 is the focus concept because it keeps the exact answer to the posed 44 Mehwish Alam and Amedeo Napoli SPARQL query and the rest of the concepts contain additional information. This way the user not only gets the desired answer but also some additional information about the drug i.e., the classification of side effects from MedDRA and the features it shares with other drugs and with which drugs it shares certain features. Object-bound queries are answered in the same way as described above by retrieving the attribute concept and the linked concepts. However, in this case the additional information will be retrieved by navigating downwards from the focus concept because it is the object-bound query and objects are contained in the intent of the pattern concepts. 5.2 Queries with Joins Now let us consider a slightly more complex query which includes a join and a class resources to be retrieved. If it is specified that the drugs to be re- trieved should be Cardiovascular Agent causing some Allergic Condition. Then the SPARQL query with subject-subject join will be as follows: SELECT ?drug WHERE { ?drug p2 D5 . ?drug p1 C10 } For obtaining the answer, the BGP in the above query is matched against the concept which contains both the predicates p1 and p2 , along with this pred- icate it keeps the classes of objects CardiovascularAgents (D5 ) connected to predicate p2 as well as AllergicConditions (C10 ) connected to the predicate p1 . The answer will be ts1 , s2 , s5 u i.e., K#2, which will be the focus concept. However, to obtain more specific information regarding these classes of drug categories and the side effects will be returned by navigating downwards in the concept lattice. Finally, a sub-lattice containing K#2, K#3, K#13, K#1 is returned as an answer to the above query. In this case, the user will have an ad- ditional information regarding more specific categories related to Cardiovascular Agents causing Allergic Conditions such as according to K#13 drugs ts1 , s5 u are D2 and K#3 ts2 u is D3 . Similarly, specific allergic conditions are also retrieved. 5.3 Queries with Filtering Criteria Now let us move towards the scenario where the doctor is looking for drug re- placement for the patient having heart conditions based on its side effects. For replacing the drugs, the system can be queried by locating all the CVAs (this information comes from DrugBank, MeSH in the current navigation space) and then excluding all the CVAs causing Allergic Conditions. The related SPARQL query can be given as follows: SELECT ?drug WHERE { ?drug p2 D5 . Towards Semantic Indexing over RDF data based on Pattern Structures 45 ?drug p1 ?se FILTER (?se != C10 ) } The condition is given in the filter clause because SPARQL does not support negation operation directly. In order to do so, all the drugs which are CVAs are retrieved by SPARQL and then it filters the drugs which do not cause allergic conditions. Same is the case with the lattice operation where concept K#4 is selected because it contains all the cardiovascular agents and then a more spe- cific concept is obtained containing the CVAs causing allergic conditions i.e., K#2 and minus operator is applied to perform the desired filtering. The answer obtained is s3 . In a sense, the navigation space follows the query mechanism provided by Google where a query is posed by the user and this query is mapped to a pre- existing cluster of documents. In the same way, when a query is posed by the user, our query mechanism maps the query to a cluster of RDF triples, where each element in an RDF triple keeps the URI of a web-page. 6 Experimentation The current approach was applied to biomedical data. This section details the experimental evaluation for the creation of the navigation space The proposed algorithm was coded in C++ and the experiments was performed using 3GB RAM on Ubuntu version 12.04. 6.1 Drug Search The datasets containing information about drugs are DrugBank, SIDER, Med- DRA and MeSH as described in section 3.1. DrugBank keeps detailed informa- tion about each of the drugs, their categories and the proteins it targets. The other database is SIDER which keeps the side effects of the drugs contained on the packaging of the drug. The access to these two data sets is provided by University of Mannheim through two different endpoints1314 . The schema associated to the side effects of the drug is available on BioPor- tal [12], which is a web portal that provides access to a repository of biomedical ontologies. BioPortal is developed by National Center for Biomedical Ontology (NCBO) [10]. The Medical Dictionary for Regulatory Activities (MedDRA) Ter- minology is the international medical terminology. During this experiment we will enrich side effects with schema level information using MedDRA terminol- ogy. In case of the drug categories MeSH vocabulary thesaurus was taken into account. MeSH (Medical Subject Headings) is a controlled vocabulary thesaurus. The drug categories from DrugBank will be enriched with the tree numbers from MeSH vocabulary. The tree numbers arrange the terms from MeSH in a hier- archical manner known as MeSH Tree Structures. In the current experiment we 13 http://wifo5-04.informatik.uni-mannheim.de/drugbank/snorql/ 14 http://wifo5-04.informatik.uni-mannheim.de/sider/snorql/ 46 Mehwish Alam and Amedeo Napoli used the MeSH vocabulary already present in the form of RDF in Bio2RDF [1, 5], which makes the public databases related to Bioinformatics such Kegg, MeSH, HGNC etc. available in the RDF format. During this experiment, two subsets of the dataset were considered. Both belonging to two major classes of drugs i.e., Cardiovascular Agents (CVA) and Central Nervous System (CNS). 6.2 Evaluation In the following, we study the scalability of Semantic Pattern Structures over large dataset. Table 5 precises the statistics of the data. Pattern concept lattices over both the chosen data sources was built in 0-25 seconds for the maximum of 31098 triples. Figure 4b shows the variation in the size of the navigation space for both data sets. The navigation space contains a maximum of around 500000 clusters of triples which were generated in 25 seconds. However, there are several ways to reduce these number of concepts. The filtering based on the depth of classes considered in the taxonomy which allows the reduction in the number of clusters while generating the concept lattice and hence causes decrease in the runtime of creating the navigation space. Most of these very general classes are not very interesting for the domain expert. Moreover, there are several measures such as support, stability and lift which allow post-filtering of the navigation space. Datasets No. of Triples No. of Subjects No. of Objects Runtime Cardiovascular Agents 31098 145 927 0-22 sec Central Nervous System 22680 105 1050 0-25 sec Table 5: Statistics of two datasets and navigation space. Fig. 4: Size of the Navigation Space for each dataset. Towards Semantic Indexing over RDF data based on Pattern Structures 47 7 Related work In [6], the author focuses on allowing conceptual navigation to RDF graphs, where each concept is accessed through SPARQL-like queries. However, in our case several RDF graphs are considered and we use the already existing, well- established and well-documented query language, SPARQL. Moreover, [3] in- troduces ontological pattern structures for enriching raw data with EL ontolo- gies. But both the approaches consider only one resource at a time, hence not targeting the problem of decentralization. As a contrast, our approach provide navigation space over RDF graphs as well as schema level information from sev- eral resources allowing user to access information from one platform. In [4], the authors introduce a new clause Categorize By which clusters SPARQL query answers w.r.t to background knowledge present as ontologies. Like our approach, only taxonomy is used for enriching, however, unlike our approach if clusters the answers after obtaining the query answers. As a contrast, our approach provides classification of the RDF triples as well as RDF Schema before hand. After- wards, SPARQL queries are mapped to the existing clusters and the answer is shown to the user. In such a case, the query response time is faster than the Categorize By clause. Moreover, as it provides clustering over answers only, it lacks the capability to provide user with additional information. 8 Conclusion and Future Work This paper proposes a new approach for navigating semantic web data and tar- gets the capabilities of FCA to deal with RDF data. It provides navigational and search capabilities over RDF triples as well as RDF Schema distributed over several resources. This paper proposes a new similarity measure for pattern structures to deal with RDF data as well as RDF Schema simultaneously, termed as Semantic Pattern Structures. The pattern concepts in the concept lattice are considered as clusters of of RDF triples which can then be navigated and queried by the user. References 1. François Belleau, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, and Jean Morissette. Bio2rdf: Towards a mashup to build bioinformatics knowledge systems. Journal of Biomedical Informatics, 41(5):706–716, 2008. 2. Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data - the story so far. Int. J. Semantic Web Inf. Syst., 5(3):1–22, 2009. 3. Adrien Coulet, Florent Domenach, Mehdi Kaytoue, and Amedeo Napoli. Using pattern structures for analyzing ontology-based annotations of biomedical data. In 11th International Conference on Formal Concept Analysis,, 2013. 4. Claudia d’Amato, Nicola Fanizzi, and Agnieszka Lawrynowicz. Categorize by: Deductive aggregation of semantic web query results. In Lora Aroyo, Grigoris An- toniou, Eero Hyvönen, Annette ten Teije, Heiner Stuckenschmidt, Liliana Cabral, and Tania Tudorache, editors, ESWC (1), volume 6088 of Lecture Notes in Com- puter Science, pages 91–105. Springer, 2010. 48 Mehwish Alam and Amedeo Napoli 5. Michel Dumontier, Alison Callahan, Jose Cruz-Toledo, Peter Ansell, Vincent Emonet, François Belleau, and Arnaud Droit. Bio2rdf release 3: A larger, more connected network of linked data for the life sciences. In Posters & Demonstrations Track ISWC., 2014. 6. Sébastien Ferré. Conceptual navigation in RDF graphs with sparql-like queries. In Formal Concept Analysis, 8th International Conference, ICFCA 2010, Agadir, Morocco, March 15-18, 2010. Proceedings, pages 193–208, 2010. 7. Bernhard Ganter and Sergei O. Kuznetsov. Pattern structures and their projec- tions. In Harry S. Delugach and Gerd Stumme, editors, ICCS, volume 2120 of Lecture Notes in Computer Science, pages 129–142. Springer, 2001. 8. Bernhard Ganter and Rudolf Wille. Formal Concept Analysis: Mathematical Foun- dations. Springer, Berlin/Heidelberg, 1999. 9. Ralf Küsters and Ralf Molitor. Computing least common subsumers in ALEN. In Proceedings of the Seventeenth International Joint Conference on Artificial Intel- ligence, IJCAI, pages 219–224, 2001. 10. Mark A. Musen, Natalya Fridman Noy, Nigam H. Shah, Patricia L. Whetzel, Christopher G. Chute, Margaret-Anne D. Storey, and Barry Smith. The national center for biomedical ontology. JAMIA, 19(2):190–195, 2012. 11. York Sure, Stephan Bloehdorn, Peter Haase, Jens Hartmann, and Daniel Oberle. The swrc ontology - semantic web for research communities. In EPIA, volume 3808 of Lecture Notes in Computer Science, pages 218–231. Springer, 2005. 12. Patricia L. Whetzel, Natalya Fridman Noy, Nigam H. Shah, Paul R. Alexander, Csongor Nyulas, Tania Tudorache, and Mark A. Musen. Bioportal: enhanced func- tionality via new web services from the national center for biomedical ontology to access and use ontologies in software applications. Nucleic Acids Research, 39(Web- Server-Issue):541–545, 2011. Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors Jean-François Viaud1 , Karell Bertet1 , Christophe Demko1 , Rokia Missaoui2 1 Laboratory L3i, University of La Rochelle, France {jviaud, kbertet, cdemko}@univ-lr.fr 2 University of Québec in Outaouais, Canada rokia.missaoui@uqo.ca Abstract. The size of a concept lattice may increase exponentially with the size of the context. When the number of nodes is too large, it becomes very difficult to generate and study such a concept lattice. A way to avoid this problem is to break down the lattice into small parts. In the subdirect decomposition, the small parts are factor lattices which are meaningful in the Formal Concept Analysis (FCA) setting. In this paper a walkthrough from a finite reduced context to its subdi- rect decomposition into subdirectly irreducible subcontexts and factors is given. The decomposition can be reached using three different points of view, namely factor lattices, arrow relations and compatible subcon- texts. The approach is mainly algebraic since it is based on abstract lattice theory, except for the last point which is inherited from FCA. We also propose a polynomial algorithm to generate the decomposition of an initial context into subcontexts. Such a procedure can be extended to conduct an interactive exploration and mining of large contexts, includ- ing the generation of few concepts and their neighborhood. Keywords: concept lattice, congruence relation, factor lattice, arrow relation, arrow closed subcontext, compatible subcontext 1 Introduction During the last decade, the computation capabilities have promoted Formal Concept Analysis (FCA) with new methods based on concept lattices. Though they are exponential in space/time in worst case, concept lattices of a reason- able size enable an intuitive representation of data organized by a context that links objects to attributes through a binary relation. Methods based on concept lattices have been developed in various domains such as knowledge discovery and representation, database management and information retrieval where some relevant concepts, i.e. possible correspondences between objects and attributes are considered either as classifiers, clusters or representative object/attribute subsets. With the increasing size of data, a set of methods have been proposed in order to either generate a subset (rather than the whole set) of concepts and c 2015 for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. Local Proceedings also appeared in ISBN 978-84-606-7410-8, Universidad de Málaga (Dept. Matemática Aplicada), Spain. 50 Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui their neighborhood (e.g. successors and predecessors) in an online and interac- tive way [8, 20] or better display lattices using nested line diagrams [13]. Such approaches become inefficient when contexts are huge. However, the main idea of lattice/context decomposition into smaller ones is still relevant when the classifi- cation property of the initial lattice is maintained. Many lattice decompositions have been defined and studied, both from algebraic [6, 16] and FCA points of view [13, 12]. We can cite Unique Factorisation Theorem [16], matrix decompo- sition [2], Atlas decomposition [13], subtensorial decomposition [13], doubling convex construction [5, 17, 14, 3] and subdirect decomposition. The latter has been widely studied many years ago, in the field of universal algebra [6, 11], and even in FCA [21–24, 12]. To the best of our knowledge, there is no new development or novel algorithms for subdirect decomposition of contexts. In this paper we investigate the subdirect decomposition of a concept lattice as a first step towards an interactive exploration and mining of large contexts. The subdirect decomposition of a lattice L into factor lattices (Li )i∈{1,...,n} , de- noted by L ,→ L1 ×· · ·×Ln , is defined by two properties (see important results in [13]): (i) L is a sublattice of the direct product L1 ×· · ·×Ln , and (ii) each projec- tion of L onto a factor is surjective. The first property establishes that each factor lattice is the concept lattice of an arrow-closed subcontext, i.e. closed accord- ing to the arrow relation between objects and attributes. This means that the decomposition can be obtained by computing specific subcontexts. The second property states that there is an equivalence between arrow-closed subcontexts and congruence relations of L, i.e., an equivalence relation whose equivalence classes form a lattice with elements closed by the meet and join operations. This means that the concepts of L can be retrieved from the factor lattices, and the classification property of L is maintained since each equivalence relation forms a partition of the elements. The last result establishes an equivalence between arrow-closed subcontexts and compatible subcontexts, i.e. subcontexts such that each concept corresponds to a concept of the initial lattice. This result gives a way to compute the morphism from L into the direct product, and thus to re- trieve the concepts of L from the factor lattices. In this paper, we deduce from these results strong links between the following notions that have not been used yet together as far as we know: – Factors of a subdirect decomposition, – Congruence relations, – Arrow-closed subcontexts and – Compatible subcontexts. As suggested in [13], the contexts of the factors of a particular subdirect decomposition, namely the subdirectly irreducible subcontexts, can be obtained by a polynomial processing of each row/object of the initial context. Therefore, the subdirect decomposition of a lattice can be extended to a subdirect decom- position of its reduced context into subdirect and irreducible subcontexts. In this paper, we propose a subdirect and polynomial decomposition of a context into subcontexts by extending the subdirect decomposition of a lattice Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors 51 into factor lattices. This decomposition leads to data storage saving of large contexts. Indeed, the generation of the whole set of factor lattices can be avoided by providing an interactive generation of a few (but not all) concepts and their neighborhood from large contexts. Moreover, a focus on a specific factor lattice can be proposed to the user by generating, partially or entirely, the concept lattice and/or a basis of implications. There are at least two reasons for studying this case of pattern manage- ment. The first one comes from the fact that users tend to be overwhelmed by the knowledge extracted from data, even when the input is relatively small. The second reason is that the community of FCA has made progress in lattice construction and exploration, and hence existing solutions can be adapted and enriched to only target useful patterns (i.e. pieces of knowledge). This paper is organized as follows. Section 2 introduces the subdirect de- composition and the three different points of view, namely factor lattices, arrow relations and compatible subcontexts. Section 3 contains the main contribution of this paper about the subdirect decomposition and the proposed algorithms. A preliminary empirical study is presented in Section 4 while Section 5 presents future work. 2 Structural framework Throughout this paper all sets (and thus lattices) are considered to be finite. 2.1 Lattices and Formal Concept Analysis Algebraic lattice Let us first recall that a lattice (L, ≤) is an ordered set in which every pair (x, y) of elements has a least upper bound, called join x ∨ y, and a greatest lower bound, called meet x ∧ y. As we are only considering finite structures, every subset A ⊂ L has a join and meet (e. g. finite lattices are complete). Concept or Galois Lattice A (formal) context (O, A, R) is defined by a set O of objects, a set A of attributes, and a binary relation R ⊂ O × A, between O and A. Two operators are derived: – for each subset X ⊂ O, we define X 0 = {m ∈ A, j R m ∀j ∈ X} and dually, – for each subset Y ⊂ A, we define Y 0 = {j ∈ O, j R m ∀m ∈ Y }. A (formal) concept represents a maximal objects-attributes correspondence by a pair (X, Y ) such that X 0 = Y and Y 0 = X. The sets X and Y are respec- tively called extent and intent of the concept. The set of concepts derived from a context is ordered as follows: (X1 , Y1 ) ≤ (X2 , Y2 ) ⇐⇒ X1 ⊆ X2 ⇐⇒ Y2 ⊆ Y1 The whole set of formal concepts together with this order relation form a complete lattice, called the concept lattice of the context (O, A, R). 52 Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui Different formal contexts can provide isomorphic concept lattices, and there exists a unique one, named the reduced context, defined by the two sets O and A of the smallest size. This particular context is introduced by means of special concepts or elements of the lattice L, namely irreducible elements. An element j ∈ L is join-irreducible if it is not a least upper bound of a subset not containing it. The set of join irreducible elements is noted JL . Meet-irreducible elements are defined dually and their set is ML . As a direct consequence, an element j ∈ L is join-irreducible if and only if it has only one immediate predecessor denoted j − . Dually, an element m ∈ L is meet-irreducible if and only if it has only one immediate successor denoted m+ . In Figure 1, join-irreducible nodes are labelled with a number and meet-ir- reducible nodes are labelled with a letter. Fig. 1. A lattice with its irreducible nodes Fundamental Bijection A fundamental result [1] establishes that any lattice (L, ≤) is isomorphic to the concept lattice of the context (JL , ML , ≤), where JL and ML are the join and meet irreducible concepts of L, respectively. Moreover, this context is a reduced one. As a direct consequence, there is a bijection between lattices and reduced con- texts where objects of the context are associated with join-irreducible concepts of the lattice, and attributes are associated with meet-irreducible concepts. Figure 2 shows the reduced context of the lattice in Figure 1. Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors 53 bcd f g j 2xxxxx 3x x xx 5xxx 6xx x 9 x x Fig. 2. The reduced context of the lattice in Figure 1 2.2 Compatible and Arrow-closed Subcontexts This section is dedicated to the equivalence between compatible and arrow-closed subcontexts. Compatible subcontexts A subcontext of a formal context (O, A, R) is a triple (J, M, R ∩ J × M ) such that J ⊂ O and M ⊂ A. A subcontext (J, M, R ∩ J × M ) of (O, A, R) is compatible if for each concept (H, N ) of (O, A, R), (J ∩ H, M ∩ N ) is a concept of (J, M, R ∩ J × M ). Arrow relations The arrow-closed subcontexts involved in the equivalence are based on the arrow relations between join and meet irreducible concepts of a lattice. Consider the reduced context (JL , ML , ≤) of a lattice (L, ≤). Arrow relations [4, 15] form a partition of the relation 6≤ (defined by not having x ≤ y) by considering the immediate predecessor j − of a join-irreducible j, and the unique immediate successor m+ of a meet-irreducible m: – j l m if j 6≤ m, j ≤ m+ and j − ≤ m. – j ↑ m if j 6≤ m, j ≤ m+ and j − 6≤ m. – j ↓ m if j 6≤ m, j 6≤ m+ and j − ≤ m. – j ◦ m if j 6≤ m, j 6≤ m+ and j − 6≤ m. In Figure 3, the reduced context of Figure 2 is enriched with the four relations l, ↑, ↓, and ◦ in the empty cells that both correspond to the case where j 6≤ m: b c d f g j 2××××× l 3× l × ↓ ×× 5××× l l ◦ 6×× l × ↓ ◦ 9 l × ◦ × ◦ ◦ Fig. 3. Arrow relation 54 Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui As an illustration, let j = 5 and m = f be join-irreducible and meet- irreducible nodes respectively (see Figure 1). Then, j − = 2 and m+ = c. The relation 5 l f holds since 5 6≤ f , 5 ≤ c and 2 ≤ f . Arrow-closed subcontext A subcontext (J, M, R ∩ J × M ) of a context (O, A, R) is an arrow-closed subcontext when the following conditions are met: – If j ↑ m and j ∈ J then m ∈ M – If j ↓ m and m ∈ M then j ∈ J As an example, the first subcontext of Figure 4 is an arrow-closed subcontext of the reduced context of Figure 3 whereas the second one is not, due to the down-arrow 6 ↓ g. cd f g cdfg 3 x x 3 x x 5xx 5xx 6x x Fig. 4. Arrow-closed and non-arrow-closed subcontexts of the context in Figure 3 Equivalence theorem First let us introduce the first equivalence we need in this paper, whose proof can be found in [13]: Theorem 1. Let (J, M, R ∩ J × M ) be a subcontext of (O, A, R). The following propositions are equivalent: – The subcontext (J, M, R ∩ J × M ) is a compatible one. – The subcontext (J, M, R ∩ J × M ) is an arrow-closed one. 2.3 Congruence Relations and Factor Lattices In this section, we introduce the equivalence between congruence relations and arrow-closed subcontexts. Quotient An equivalence relation is a binary relation R over a set E which is reflexive, symmetric, and transitive. An equivalence class of x ∈ E is: xR = {y ∈ E | xRy} The set of equivalence classes, called the quotient set E/R, is: E/R = {xR | x ∈ E} Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors 55 Factor lattice A congruence relation Θ on a lattice L is an equivalence relation such that: x1 Θy1 and x2 Θy2 =⇒ x1 ∧ x2 Θy1 ∧ y2 and x1 ∨ x2 Θy1 ∨ y2 The quotient L/Θ verifies the following statement: xΘ ≤ yΘ ⇐⇒ xΘ(x ∧ y) ⇐⇒ (x ∨ y)Θy With such an order, L/Θ is a lattice, called factor lattice. A standard theorem from algebra, whose proof is omitted, states that: Theorem 2. The projection L → L/Θ is a lattice morphism onto. The second equivalence theorem We are now able to formulate the second equivalence whose proof can be found in [13]: Theorem 3. Given a lattice L, the set of congruence relations on L corresponds bijectively with the set of arrow-closed subcontexts of the reduced context of L. Congruence relations will be computed with this theorem. However, other algorithms exist [9, 10]. 2.4 Subdirect decompositions In this section, we introduce the equivalence between subdirect decompositions and sets of arrow-closed subcontexts. Subdirect product Definition 1. A subdirect product is a sublattice of a direct product L1 ×· · ·×Ln of lattices Li , i ∈ {1, . . . , n} such that each projection onto a factor is surjective. The lattices Li , i ∈ {1, . . . , n} are the factor lattices. A subdirect decomposition of a lattice L is an isomorphism between L and a subdirect product which can be denoted as: L ,→ L1 × · · · × Ln Li The third equivalence theorem The third and most important equivalence whose proof can be found in [13], makes a connection with sets of arrows-closed subcontexts when they cover the initial context: Proposition 1. Given a reduced context (O, A, R), then the subdirect decom- positions of its concept lattice L correspond bijectively to the families of arrow- closed subcontexts (Jj , Mj , R ∩ Jj × Mj ) with O = ∪Jj and A = ∪Mj . 56 Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui 3 Our contribution 3.1 Main Result From the three previous equivalences found in [13], we deduce the following one: Corollary 1. Given a lattice L and its reduced context (O, A, R), we have an equivalence between: 1. The set of arrow-closed subcontexts of (O, A, R), 2. The set of compatible subcontexts of (O, A, R), 3. The set of congruence relations of L and their factor lattices. Corollary 2. Given a lattice L and its reduced context (O, A, R), we have an equivalence between: 1. The families of arrow-closed subcontexts of (O, A, R) covering O and A, 2. The families of compatible subcontexts of (O, A, R) covering O and A, 3. The families (θi )i∈I of congruence relations of L such that ∩i∈I θi = ∆ with x∆y ⇐⇒ x = y. 4. The set of subdirect decompositions of L and their factor lattices. In the following, we exploit these four notions that, to the best of our knowl- edge, have not been put together yet. 1. The subdirect decomposition ensures that L is a sublattice of the factor lattice product. Moreover, each projection from L to a factor lattice is sur- jective. 2. The congruence relations of L indicate that factor lattices correspond to their quotient lattices, and thus preserve partitions via equivalence classes. 3. The compatible subcontexts give a way to compute the morphism from L onto its factors. 4. Arrow-closed subcontexts enable the computation of the reduced context of the factor lattices. In the following we present the generation of a particular subdirect decom- position and show a possible usage of factor lattices. 3.2 Generation of Subdirectly Irreducible Factors In this section, we consider subdirect decompositions of a lattice L with its re- duced context (O, A, R) as input. From Corollary 2, a subdirect decomposition of a lattice L can be obtained by computing a set of arrow-closed subcontexts of (O, A, R) that have to cover O and A. There are many sets of arrow-closed sub- contexts and thus many subdirect decompositions. In particular, the decomposi- tion from a lattice L into L itself is a subdirect decomposition, corresponding to the whole subcontext (O, A, R) which is clearly arrow-closed. A subdirect decom- position algorithm has been proposed in [12]. However, all congruence relations Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors 57 are computed and then only pairs of relations are formed to get a decomposi- tion. As a consequence, potentially multiple decompositions are produced with necessarily two factors. In this article, we focus on the subdirect decomposition of a context into a possibly large number of small factors, i.e. factors that cannot be subdirectly decomposed. A factor lattice L is subdirectly irreducible when any subdirect decomposition of L leads to L itself. A nice characterization of subdirectly irre- ducible lattices can be found in [13]: Proposition 2. A lattice L is subdirectly irreducible if and only if its reduced context is one-generated. A context (O, A, R) is one-generated if it can be obtained by arrow-closing a context with only one j ∈ J. Thus (O, A, R) is the smallest arrow-closed subcontext containing j ∈ J. Therefore, we deduce the following result: Proposition 3. Let L be a lattice. From L, we can deduce a product lattice L1 × ... × Ln where each lattice Li is: – the concept lattice of a one-generated subcontext, – subdirectly irreducible, – a factor lattice of the subdirectly irreducible decomposition. From this result, we propose an algorithm (Algorithm 1) to compute in poly- nomial time the contexts of the factor lattices L1 , . . . , ...Ln of a subdirectly irreducible decomposition, with a reduced context (O, A, R) as input. The one- generated subcontexts for each j ∈ J are obtained by arrow-closing (Algorithm 2). The subdirectly irreducible decomposition of L can then be obtained by computing the concept lattices of these subcontexts. One can notice that the closure computed from join-irreducible concepts can also be calculated from meet-irreducible concepts. Algorithm 1: Subdirect Decomposition Input: A context (O, A, R) Output: List L of the contexts (Jj , Mj , Rj ) of the subdirectly irreducible factor lattices 1 L ← ∅; 2 forall the j ∈ O do 3 Compute (Jj , Mj , Rj ) = Arrow Closure((j, ∅, ∅), (O, A, R)), the one-generated subcontext span by j; 4 if L does not contain any subcontext that covers (Jj , Mj , Rj ) then 5 add (Jj , Mj , Rj ) to L 6 if L contains a subcontext covered by (Jj , Mj , Rj ) then 7 delete it from L 8 return L; 58 Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui Algorithm 2: Arrow Closure ˜ M̃ , R̃) of a context (J, M, R) Input: A subcontext (J, Output: Arrow-closure of (J,˜ M̃ , R̃) ˜ 1 Jc = J; Mc = M̃ ; 2 predJ = 0; predM = 0; 3 while predM < card(Mc ) or predJ < card(Jc ) do 4 predJ = card(Jc ); 5 predM = card(Mc ); 6 forall the j ∈ Jc do 7 add to Mc all m ∈ M such that j ↑ m; 8 forall the m ∈ Mc do 9 add to Jc all j ∈ J such that j ↓ m; 10 Return (Jc , Mc , R ∩ Jc × Mc ) j Arrow Closure Con- tained in L Input Output ˜ M̃ , R̃) (J, Jc Mc 2 (2, ∅, ∅) {2} {j} × 3 (3, ∅, ∅) {3} {c} 5 (5, ∅, ∅) {3, 5, 6} {c, d, f, g} × 6 (6, ∅, ∅) {6} {d} 9 (9, ∅, ∅) {9} {b} × Fig. 5. Iterations of Algorithm 1 for the reduced context in Figure 2 Consider the reduced context in Figure 2. Each iteration of Algorithm 1 is described by Figure 5 for each value of j, the input and output of Algorithm 2, and the three one-generated subcontexts that belong to L at the end of the process. Therefore we get three factor lattices (see Figure 6). The first subcontext is the one on the left of Figure 4. The two other ones are: ({2}, {j}, ∅) and ({9}, {b}, ∅). The latter two subcontexts are interesting because: – They show that the initial lattice has parts which are distributive. Indeed, these two subcontexts contain exactly one double arrow in each line and each column. – They give us a dichotomy: any concept contains either 2 or j ; any concept contains either 9 or b – In the reduced context, an arrow brings a deeper knowledge than a cross. Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors 59 (a) (b) First subcontext in (c) ({2}, {j}, ∅) Figure 4 ({9}, {b}, ∅) Fig. 6. The three factor lattices of the decompostion with their subcontext as caption The context on the left hand-side of Figure 4 is tricky to understand. For the other ones, we have a simple relation 2 l j or 9 l b, which means that, for instance, 2 and j are some kind of complement or converse. Figure 7 shows a factor lattice and its corresponding congruence. Fig. 7. Factor lattice and congruence 3.3 Onto Morphism and FCA A subdirect decomposition of a lattice L into factor lattices L1 , . . . , Ln is relevant since there exists an into morphism from L to the product lattice L1 × . . . × Ln . This morphism is specified by the bijection between compatible subcontexts and congruence relations stated by Corollary 1: 60 Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui Proposition 4. Let (J, M, R ∩ J × M ) be a compatible subcontext, then the relation ΘJ,M defined by: (A1 , B1 )ΘJ,M (A2 , B2 ) ⇐⇒ A1 ∩ J = A2 ∩ J ⇐⇒ B1 ∩ M = B2 ∩ M is a congruence relation, and its factor lattice is isomorphic to the concept lattice of the subcontext (J, M, R ∩ J × M ). Algorithm 3 computes this morphism: each concept of L is computed as the product of concepts in each factor, and then marked in the product lattice. From Algorithm 3, we get the large tagged lattice product shown in Figure 8. Obviously, this algorithm is not intended to be used in a real application with large contexts since the product lattice is much more bigger than the original one, while the main goal of the decomposition is to get smaller lattices. We only use this algorithm in the empirical study. Nevertheless, this morphism can be extended to develop basic FCA pro- cessing. Once the subdirectly irreducible decomposition of a reduced context (O, A, R) into the contexts C1 , . . . , Cn is computed, an interactive exploration and mining process can easily be considered by using the following basic tasks and avoiding the generation of the lattice for the whole context (O, A, R): – Compute the smallest concept of L that contains a given subset of objects or attributes, and identify its neighborhood – Compute the smallest concept cij and its neighborhood in a subset of factors that contain a given collection of objects or attributes. Each factor Li is a specific view of data. Algorithm 3: Into morphism Input: Initial lattice L; Subcontexts (Jj , Mj , Rj ); Product lattice P = L1 × . . . × Ln Output: Product lattice P with nodes coming from L marked. 1 forall the c = (A, B) ∈ L do 2 forall the (Jj , Mj , Rj ) do 3 Compute (A ∩ Jj , B ∩ Mj ); 4 Mark, in P , the product node Πj (A ∩ Jj , B ∩ Mj ); 4 Experiments In this section, we conduct an empirical study in order to better understand the impact of the subdirect decomposition on contexts with different density levels. All the tests were done using the java-lattices library available at: Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors 61 Fig. 8. Ugly tagged lattice product http://thegalactic.github.io/java-lattices/ A thousand of contexts of 10 observations and 5 attributes were randomly generated. The experiments have been done three times using three density values, namely 20%, 50% and 80%. Figure 9 presents the number of generated lattices according to their size and their density. We can observe two histograms with a nice gaussian shape in the first two cases, but a strange behavior in the last case. However, we roughly observe that the higher the density is, the bigger the lattices are. Therefore, the context density will have an impact on the decomposition process. (a) Density of 20% (b) Density of 50% (c) Density of 80% Fig. 9. Number of lattices per size. The number of generated factors is given in Figure 10. One can observe that this number increases with the density of the initial context since the corre- sponding lattices have more edges. With a context density of 20%, we get 87.40% irreducible contexts. However, with a density of 80%, only 1.50% of contexts are irreducible and 70% of contexts have at least 4 factors in their decomposition. Thus, lattices are more decomposable when they come from dense contexts. 62 Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui Factors Density=20% Density=50% Density=80% 1 87.40% 70.50% 1.50% 2 9.70% 21.40% 9.90% 3 2.60% 6.20% 18.70% 4 0.30% 0.50% 23.40% 5 1.40% 28.50% 6 18.00% Fig. 10. Proportion of the number of factors in the subdirect decomposition of the contexts according to their density Let us now examine the size of factors with two different density values, namely 20% and 50%. Figure 11 gives the number of cases (initial lattices) according to the number of produced factors. Of course, we observe that smaller lattices give rise to smaller factors. We can also see that in these two density cases, the largest number is obtained for factors of size 2. (a) Density 20% (b) Density 50% Fig. 11. Number of cases according to the number of generated factors. The last part of the tests aims at using contexts with a large density of 80% and computing the ratio between the number of nodes (edges resp.) of the initial lattice and the number of nodes (edges resp.) of the product lattice. We thus get Figure 12 which shows that the higher is the number of factors, the bigger is the product lattice. Consequently, the set of useless (void) nodes in the product becomes larger as the number of factors increases. We have also conducted experiments with 40 observations and 15 attributes, and a hundred of contexts were generated using two density values: 20% and 50%. Unfortunately, all were subdirectly irreducible. Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors 63 Factors % of nodes % of edges 1 100% 100% 2 97.90% 97.20% 3 89.20% 84.60% 4 85.00% 77.40% 5 80.60% 71.80% 6 68.00% 57.30% Fig. 12. Proportion of untagged nodes and edges 5 Conclusion and future work In this paper, we have presented a polynomial algorithm for the decomposition of a reduced context into subcontexts such that the concept lattice of each sub- context is a subdirectly irreducible factor. This decomposition is a direct conse- quence inferred from strong links between factors of a subdirect decomposition, congruence relations, arrow-closed subcontexts and compatible subcontexts es- tablished in [13]. To further investigate the subdirect decomposition, it would be interesting to conduct large-scale experiments on real world data not only to confirm/nullify the preliminary empirical tests but also to understand the semantics behind the generated irreducible contexts. In particular, attributes covered by several factors interfere in different views of data whose semantics must be interesting to understand. Moreover, it would be useful to allow the user to interactively select a few factors of the decomposition by mixing our approach with the one in [12]. From a theoretical point of view, we think that there are strong links between the implication basis in a quotient lattice and the one from the initial lattice. To the best of our knowledge, this issue has never been addressed and could have significant algorithmic impacts. However, we note that [19] tackle a similar issue in case of a vertical decomposition of a context into subcontexts. Since the empirical study in [18] show that many real-life contexts are subdi- rectly irreducible, we plan to (i) identify cases in which a context is necessarily irreducible, and (ii) study, compare and combine other decompositions, in par- ticular the Fratini congruence [7], and the reverse doubling convex construction [5, 17, 14, 3]. Finally, the construction of a lattice from its factor lattices can be done based on the optimization principles behind the relational join operation in databases. References 1. Barbut, M., Monjardet, B. (eds.): L’ordre et la classification. Algèbre et combina- toire, tome II, Hachette (1970) 2. Belohlavek, R., Vychodil, V.: Discovery of optimal factors in binary data via a novel method of matrix decomposition. Journal of Computer and System Sciences 76(1), 3–20 (Feb 2010) 64 Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui 3. Bertet, K., Caspard, N.: Doubling convec sets in lattices: characterizations and recognition algorithms. Tech. Rep. TR-LACL-2002-08, LACL (Laboratory of Al- gorithms, Complexity and Logic), University of Paris-Est (Paris 12) (2002) 4. Crawley, P., Dilworth, R.: Algebraic theory of lattices. Prentice Hall, Englewood Cliffs (1973) 5. Day, A.: Congruence normality: The characterization of the doubling class of con- vex sets. algebra universalis 31(3), 397–406 (1994) 6. Demel, J.: Fast algorithms for finding a subdirect decomposition and interesting congruences of finite algebras. Kybernetika (Prague) 18(2), 121–130 (1982) 7. Duquenne, V.: Lattice drawings and morphisms. In: Formal Concept Analysis, 8th International Conference, ICFCA 2010, Agadir, Morocco, March 15-18, 2010. Proceedings. pp. 88–103 (2010) 8. Ferré, S.: Reconciling Expressivity and Usability in Information Access from File Systems to the Semantic Web. Ph.D. thesis, Univeristy Rennes 1 (2014) 9. Freese, R.: Computing congruence lattices of finite lattices. Proceedings of the American Mathematical Society 125(12), 3457–3463 (1997) 10. Freese, R.: Algorithms in finite, finitely presented and free lattices. Preprint, July 22, 159–178 (1999) 11. Freese, R.: Computing congruences efficiently. Algebra universalis 59(3-4), 337–343 (2008) 12. Funk, P., Lewien, A., Snelting, G.: Algorithms for concept lattice decomposition and their applications. Tech. rep., TU Braunschweig (December 1995) 13. Ganter, B., Wille, R.: Formal concept analysis - mathematical foundations. Springer (1999) 14. Geyer, W.: The generalized doubling construction and formal concept analysis. algebra universalis 32(3), 341–367 (1994) 15. Grätzer, G.: General lattice theory. Birkhäuser-Verlag, Basel (1978) 16. Mihók, P., Semanis̃in, G.: Unique factorization theorem and formal concept anal- ysis. In: Yahia, S., Nguifo, E., Belohlavek, R. (eds.) Concept Lattices and Their Applications, Lecture Notes in Computer Science, vol. 4923, pp. 232–239. Springer Berlin Heidelberg (2008) 17. Nation, J.: Alan day’s doubling construction. algebra universalis 34(1), 24–34 (1995) 18. Snelting, G.: Concept lattices in software analysis. In: Formal Concept Analysis, Foundations and Applications. pp. 272–287 (2005) 19. Valtchev, P., Duquenne, V.: On the merge of factor canonical bases. In: Inter- national Conference on Formal Concept Analysis ICFCA, pp. 182–198. Springer Berlin Heidelberg (2008) 20. Visani, M., Bertet, K., Ogier, J.M.: Navigala: an Original Symbol Classifier Based on Navigation through a Galois Lattice. International Journal on Pattern Recog- nition and Artificial Intelligence (IJPRAI) (2011) 21. Wille, R.: Subdirekte produkte und konjunkte summen. Journal für die reine und angewandte Mathematik 0239 0240, 333–338 (1969) 22. Wille, R.: Subdirekte Produkte vollständiger Verbände. J. reine angew. Math. 283/284, 53–70 (1976) 23. Wille, R.: Subdirect decomposition of concept lattices. Algebra Universalis 17, 275–287 (1983) 24. Wille, R.: Subdirect product construction of concept lattices. Discrete Mathematics 63(2-3), 305–313 (1987) Context-Dependent Deductive and Inductive Reasoning Based on Good Diagnostic Tests Xenia Naidenova1 and Vladimir Parkhomenko2 1 Military Medical Academy, Saint-Petersburg, Russia ksennaidd@gmail.com 2 Peter the Great St. Petersburg Polytechnic University, St. Petersburg, Russia parhomenko.v@gmail.com Abstract. A sketch of classification reasoning is given in the paper. The key ideas of the reasoning are ideas of classification and its good approximations based on good diagnostic tests. Such good tests, which are maximally redundant (GMRTs), i.e. their subsets of attributes are closed, are considered. Classification reasoning embraces two interrelated processes: inductive inferring implicative assertions based on GMRTs and using these assertions in deductive steps of reasoning. A peculiarity of our inductive model of classification reasoning is the control and trans- formation of subcontexts during inductive phase of reasoning by the use of deductive reasoning rules, for example the rules prohibiting some pairs of attributive values. Keywords: good diagnostic test, classification reasoning, essential at- tributes, essential objects, implications, subcontexts, deduction, induc- tion, closed sets 1 Introduction Classification reasoning is a part of commonsense (plausible) reasoning based on two kinds of logical rules extracted from observable datasets (functional and implicative dependencies). The principle concepts of this reasoning are the con- cepts of classification and its good approximations. We assume that objects are described by a set U of symbolic or numeric attributes. To give a target clas- sification of objects, we use an additional attribute KL not belonging to U . A target attribute partitions a given set of objects into disjoint classes the number of which is equal to the number of values of this attribute. In Tab. 1, we have two classes: KL+ (positive objects) and KL− (negative objects). We concentrate on the case of object classification into two classes and task of concept formation [2,5]. The goal of this task is to describe/classify new objects according to description/classification of existing objects. Inferring good diag- nostic tests (GDTs) is the formation of the best descriptions of a given object class KL+ against the objects not belonging to this class (KL− ). Let M = {∪dom(attr), attr ∈ U }, where dom(attr) is the set of all values of attr. Let X ⊆ M and G be the set of indices of objects considered (objects c 2015 for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. Local Proceedings also appeared in ISBN 978-84-606-7410-8, Universidad de Málaga (Dept. Matemática Aplicada), Spain. 66 Xenia Naidenova and Vladimir Parkhomenko Table 1. Motivating Example of classification No Height Color of Hair Color of Eyes KL 1 Low Blond Blue KL+ 2 Low Brown Blue KL− 3 Tall Brown Hazel KL− 4 Tall Blond Hazel KL− 5 Tall Brown Blue KL− 6 Low Blond Hazel KL− 7 Tall Red Blue KL+ 8 Tall Blond Blue KL+ for short), G = G+ ∪ G− , where G+ and G− the sets of positive and negative objects, respectively. Denote by d(g) the description of object g ∈ G. Let P(X) = {g | g ∈ G, X ⊆ d(g)}. We call P(X) the interpretation of X in the power set 2G . If P(X) contains only positive objects and the number of these objects more than 2, then we call X a description of some positive objects and (P(X), X) a test for positive objects. Let us define a good test or good description of objects. Definition 1. A set X ⊆ M of attribute values is a good description of positive (negative) objects if and only if it is the description of these objects and no such subset Y ⊆ M exists, that P(X) ⊂ P(Y ) ⊆ G+ . It can be seen [15,11,1] that this problem is reduced to searching for causal dependencies in the form X → v, X ⊆ M, v is value of attribute KL for all positive (negative) objects. Sec.2 describes an approach to classification reasoning based on implications. Sec.3 describes the definitions of Good Test Analysis. Sec.4 presents the de- composition of inferring good tests into two kinds of subtasks. An approach to selecting and ordering subcontexts is given in Subsec.5.1. Some situations of re- ducing the classification context are considered in Subsec.5.2. A set of examples illustrates all the considerations. 2 Classification Reasoning Classification reasoning based on GDTs embraces two interrelated processes: inductive inferring implicative assertions and using them for pattern recognition problems. We need the following three types of logical rules in order to realize classification reasoning (deductive and inductive): – INSTANCES or relationships between objects or facts really observed. In- stance can be considered as a logical rule with the least degree of general- ization. On the other hand, instances can serve as a source of training set of positive and negative examples for inductive inference of generalized rules. Context-Dependent Reasoning Based on Good Diagnostic Tests 67 – RULES OF THE FIRST TYPE or logical assertions. These rules describe regular relationships between objects and their properties, between proper- ties of different objects, between objects and their classes. The rules of the first type can be given explicitly by an expert or derived automatically from instances with the help of some learning process. These rules are represented, in the paper, in the form of implications. – RULES OF THE SECOND TYPE or commonsense reasoning rules with the help of which rules of the first type are used, updated, and inferred from data (instances). The rules of the second type embrace both inductive and deductive reasoning rules. The rules of the first type can be represented with the use of only one class of logical statements based on implicative dependencies between names. Names are used for designating concepts, things, events, situations, or any evidences. They can be considered as attributes and attributes’ values in the formal repre- sentations of logical rules. 1. Implication: a, b, c → d. This rule means that if the values standing on the left side of rule are simultaneously true, then the value on the right side of rule is always true. 2. Interdiction or forbidden rule (a special case of implication): a, b, c → f alse (never). This rule interdicts a combination of values enumerated on the left side of rule. The rule of interdiction can be transformed into several implications such as a, b →not c; a, c →not b; b, c →not a. 3. Compatibility: a, b, c → VA where VA is the frequency of occurrence of rule. The compatibility is equivalent to the following implications: a, b → c, VA ; a, c → b, VA; b, c → b, VA. Generally, the compatibility rule represents the most common combination of values, which is characterized by an insignificant number of exceptions (contrary examples) from regularity. 4. Diagnostic rule: x, d → a; x, b →not a; d, b → f alse. For example, d and b can be two values of the same attribute. This rule works when the truth of ’x’ has been proven and it is necessary to determine whether ’a’ is true or not. If ’x&d’ is true, then ’a’ is true, if ’x&b’ is true, then ’a’ is f alse. 5. Rule of alternatives: a or b → true (always); a, b → f alse. This rule says that a and b cannot be simultaneously true, either a or b can be true but not both. This rule is a variant of interdiction. Deductive steps of classification reasoning consist of inferring consequences from some observed facts with the use of implications (i.e. knowledge). For this goal, deductive rules of reasoning are applied the main forms of which are modus ponens, modus tollens, modus ponendo tollens, and modus tollendo ponens. Let x be a set of true values of some attributes observed simultaneously. Deductive reasoning rules of the second type are: 1. Using implication: Let r be an implication, left(r) be the left part of r and right(r) be the right part of r. If left(r) ⊆ x, then x can be extended by 68 Xenia Naidenova and Vladimir Parkhomenko right(r) : x ← x ∪ right(r). Using implication is based on modus ponens: if A, then B; A; hence B. 2. Using interdiction: Let r be an implication y →not k. If left(r) ⊆ x, then k is the forbidden value for all extensions of x. Using interdiction is based on modus ponendo tollens: either A or B (A, B — alternatives); A; hence not B; either A or B; B; hence not A. 3. Using compatibility: Let r = ’a, b, c → k, VA’. If left(r) ⊆ x, then k can be used to extend x along with the calculated value VA for this extension (a special consideration is required for calculating VA). 4. Using diagnostic rules: Let r be a diagnostic rule such as ’x, d → a; x, b →not a’, where ’x’ is true, and ’a’, ’not a’ are hypotheses or possible values of some attribute. Using the diagnostic rule is based on modus ponens and modus ponendo tollens. There are several ways for refuting one of the hypotheses: – To infer either d or b by using existing knowledge; – To involve new known facts (extended context) and/or propositions for inferring (with the use of inductive reasoning rules of the second type) new implications for distinguishing between the hypotheses ’a’ and ’not a’; to apply the newly obtained rules; – To get the direct answer on whether d or b is true by involving an external source of knowledge. 5. Using rule of alternatives: Let ’a’ and ’b’ be two alternatives (mutually ex- clusive) hypotheses about the value of some attribute, and the truth of one of hypotheses has been established, then the second hypothesis is rejected. Using the rule is based on modus tollendo ponens: either A or B (A, B — alternatives); not A; hence B; either A or B; not B; hence A. We can call the rules listed above the rules of “forward inference”. However, we have in natural classification reasoning so called “backward inference”. 6. Generating hypothesis or abduction rule. Let r be an implication y → k. We have “if k is true, then y may be true”. 7. Using modus tollens. Let r be an implication y → k. If ’not k’ is inferred, then ’not y’ is also inferred. When applied, the above given rules generate the reasoning, which is not demonstrative. The purpose of it is to infer all possible hypotheses of the value of some objective attribute. It is essential that hypotheses do not contradict with knowledge (first-type rules) and the observable real situation, where the reason- ing takes place (contextual knowledge). Inference of hypotheses is reduced to constructing all intrinsically consistent extensions of a set of values in which the number of involved attributes is maximum possible (there are no prohibited pairs of values in such extensions). All hypotheses have different admissibility determined by the quantity and “quality” of compatibility rules involved in rea- soning. A very simple structure of knowledge base (KB) and an example of applying the rules of second type for a pattern recognition problem can be found in [19]. The process of deductive reasoning can require inferring new rules of the first type from data when i) the result of reasoning contains several hypotheses and Context-Dependent Reasoning Based on Good Diagnostic Tests 69 it is impossible to choose one and only one of them (uncertainty), and ii) it is impossible to infer any hypothesis. Inductive steps of classification reasoning consist in using already known facts, observations and experimental results for inferring implications and their correcting. The main forms of induction have been formulated by J. Stuart Mill [12]. The formalization of these forms of induction has been offered by V.K. Finn [6]. For inferring GDTs, the Combined Similarity and Difference Method of the Mill are used and realized by means of inductive inference rules of the second type proposed in [13]. Deductive rules of the second type described above also work for inductive inferring implications since implications obtained are involved immediately to reduce the search space. In the paper, we shall illustrate using the forbidden and diagnostic rules of the second type for GDTs inferring. 3 Key Notions of Good Test Analysis Assume that G = 1, N be the set of objects and M = {m1 , m2 , . . . , mj , . . . mm } be the set of values. The object descriptions are represented by rows of a table, columns of which are associated with the attributes taking their values in M (see, please, Tab.1). Denote by {Bg | Bg ⊆ M, g ∈ G} the description of an object. The Galois connection [21] between the ordered sets (2G , ⊆) and (2M , ⊆), i.e. 2G → 2M and 2M → 2G , is defined by the following mappings called derivation operators (see although different notations in [16,9]): for A ⊆ G and B ⊆ M , A0 = val(A) = {intersection of all Bg | Bg ⊆ M, g ∈ A} and B 0 = obj(B) = {g| g ∈ G, B ⊆ Bg }. We consider obj(B) = {intersection of all obj(m)| obj(m) ⊆ G, m ∈ B}. In the Formal Concept Analysis, for g ∈ G and m ∈ M , {g}0 or simply g 0 is called object intent, and {m}0 or simply m0 is called attribute extent [7]. There are two closure operators [16]: generalization of(B) = B 00 = val(obj(B)) and generalization of(A) = A00 = obj(val(A)). A set A is closed if A = obj(val(A)) and a set B is closed if B = val(obj(B)). If A0 = B & B 0 = A, then a pair (A, B) is called a formal concept, subsets A and B of which are called concept extent and intent, respectively. A triple (G, M, I), where I is a binary relation G × M is called a formal context K. Let K := (G , M, I ) and I := I ∩ (G × M ), where ∈ {+, −} (one can add the value τ if there is an undefined object) [8]. K± := (G+ ∪ G− , M ∪ {KL}, I+ ∪ I− ∪ (G± × {KL})). Definition 2. A Diagnostic (classification) Test (DT) for G+ is a pair (A, B) such that B ⊆ M , A = obj(B) 6= ∅, A ⊆ G+ , and B 6⊆ val(g), ∀g ∈ G− . Equivalently, obj(B) ∩ G− = ∅. Definition 3. A DT (A, B) for G+ is maximally redundant if obj(B ∪ m) ⊂ A for all m ∈ / B and m ∈ M . Definition 4. A DT (A, B) for G+ is irredundant if any narrowing B∗ = B \ m, m ∈ B implies that (obj(B∗ ), B∗ )) is not a test for G+ . 70 Xenia Naidenova and Vladimir Parkhomenko Definition 5. A DT (A, B) for G+ is good if and only if any extension A∗ = / A, i ∈ G+ implies that (A∗ , val(A∗ )) is not a test for G+ . A ∪ i, i ∈ If a good test (A, B) for G+ is maximally redundant (GMRT), then any extension B∗ = B ∪ m, m ∈ / B, m ∈ M implies that (obj(B∗ ), B∗ ) is not a good test for G+ . The relation GMRTs to the Formal Concept Analysis is considered in [17]. Any object description d(g), g ∈ G in a given classification context is a maxi- mally redundant set of values because ∀m ∈ / d(g), m ∈ M, obj(d(g) ∪ m) is equal to ∅. In Tab.1, ((1, 8), Blond Blue) is a GMRT for KL+ , ((4, 6), Blond Hazel) is a maximally redundant test for KL− but not a good one, and ((3, 4, 6), Hazel) is a good maximally redundant and, simultaneously, good irredundant test for KL− . 4 Decomposition of Inferring GMRTs into Subtasks There are two kinds of subtasks for inferring GMRTs: 1. given a set of values B, where B ⊆ M, obj(B) 6= ∅, B 6⊆ d(g), ∀g ∈ G− , find all GMRTs (obj(B∗ ), B∗ ) such that B∗ ⊂ B; 2. given a non-empty set of values X ⊆ M such that (obj(X), X) is not a test for positive objects, find all GMRTs (obj(Y ), Y ) such that X ⊂ Y . For solving these subtasks we need to form subcontexts of a given classifica- tion context K± . In the first case, we have to include in subcontext all positive objects, the descriptions of which intersect B. The subtask is useful to find all GMRTs intents of which are contained in the description d(g) of an object g. This subtask is considered in [5] for fast incremental concept formation. The fol- lowing notions of object and value projections are developed to form subcontexts [18]. Definition 6. The projection proj(d(g)), g ∈ G+ on the set D+ , where D+ is the set of descriptions of all positive objects, is denoted by {z| z = d(g) ∩ d(g∗ ) 6= ∅, d(g∗ ) ∈ D+ & (obj(z), z) is a test for G+ }. Note that d(g) ∈ proj(d(g)). Definition 7. The value projection on a given set D+ proj(B) is {d(g) | B ⊆ d(g), d(g) ∈ D+ } or, equivalently, proj(B) = {d(g) | g ∈ (obj(B) ∩ G+ )}. The projections define the methods to construct two kinds of subcontexts of K± . The following theorem gives the foundation of reducing subcontexts [14]. Theorem 1. Let X ⊆ M, (obj(X), X) be a maximally redundant test for pos- itive objects and obj(m) ⊆ obj(X), m ∈ M . Then m can not belong to any GMRT for positive objects different from (obj(X), X). Context-Dependent Reasoning Based on Good Diagnostic Tests 71 Let splus(m) be obj(m)∩G+ (obj(m)∩G− ) and SPLUS={splus(m), m ∈ M }. Consider an example of reducing subcontext in Tab. 1. For values Hazel, Brown, Tall, Blue, Blond, and Low, SPLUS={obj(m)∩G− } = {{3, 4, 6}, {2, 3, 5}, {3, 4, 5}, {2, 5}, {4, 6}, {2, 6}}. We have val(obj(Hazel)) = Hazel, hence ((3,4,6), Hazel) is a test for G− . Then value Blond can be deleted from consideration, because splus(Blond) ⊂ splus(Hazel). It is essential that the projection is a subset of object descriptions defined on a certain restricted subset t∗ ⊆ M of values. Let s∗ be the subset of objects, descriptions of which produce the projection. In the projection, splus(m) = obj(m) ∩ s∗ , m ∈ t∗ . Let STGOOD be the partially ordered set of elements s such that (s, val(s)) is a good test for D+ . Forming the Set STGOOD. The important part of our basic recursive algorithm is how to form the set STGOOD. Let L = (2S , ≤) be the set lattice [22]. The ordering determined in the set lattice coincides with the set-theoretical inclusion. Subset s1 is absorbed by subset s2 , i.e. s1 ≤ s2 , if and only if s1 ⊆ s2 . Under formation of STGOOD, subset s ⊆ G+ is stored in STGOOD if and only if it is not absorbed by any element of this set. It is necessary also to delete from STGOOD all the elements that are absorbed by s if s is stored in STGOOD. Thus, when the algorithm is over, the set STGOOD contains all the collections of objects that are the extents of GMRTs and only such collections. The set TGOOD of all the GMRTs is obtained as follows: TGOOD = {tg| tg = (s, val(s)), s ∈ STGOOD}. The basic recursive procedure for solving any kind of subtasks has been described in [20]. 5 Selecting and Ordering Subcontexts and Inferring GMRTs Algorithms of GMRTs inferring are constructed by the rules of selecting and ordering subcontexts of the main classification context. Before entering into the details, we give some new definitions. Definition 8. Let t be a set of values such that (obj(t), t) is a test for G+ . The value m ∈ M, m ∈ t is essential in t if (obj(t \ m), (t \ m)) is not a test for a given set of objects. Generally, we are interested in finding the maximal subset sbmax(t) ⊂ t such that (obj(t), t) is a test but (obj(sbmax(t)), sbmax(t)) is not a test for a given set of positive objects. Then sbmin(t) = t \ sbmax(t) is a minimal set of essential values in t. Definition 9. Let s ⊆ G+ , assume also that (s, val(s)) is not a test. The object g, g ∈ s is said to be an essential in s if (s \ g, val(s \ g)) proves to be a test for a given set of positive objects. 72 Xenia Naidenova and Vladimir Parkhomenko Generally, we are interested in finding the maximal subset sbmax(s) ⊂ s such that (s, val(s)) is not a test but (sbmax(s), val(sbmax(s))) is a test for a given set of positive objects. Then sbmin(s) = s \ sbmax(s) is a minimal set of essential objects in s. Finding quasi-maximal (minimal) subsets of objects and values is the key procedure behind searching for initial content of STGOOD and determining the number of subtasks to be solved. 5.1 Using the Diagnostic Rule of the Second Type An Approach for Searching for Initial Content of STGOOD. In the beginning of inferring GMRTs, the set STGOOD is empty. Next we describe the procedure to obtain an initial content of it. This procedure extracts a quasi- maximal subset s∗ ⊆ G+ , which is the extent of a test for G+ (maybe not good). We begin with the first object g1 of s∗ , then we take the next object g2 of s∗ and evaluate the function to be test({g1 , g2 }, val({g1 , g2 })). If the value of the function is true, then we take the next object g3 of s∗ and evaluate the function to be test({g1 , g2 , g3 }, val({g1 , g2 , g3 })). If the value of the function to be test({g1 , g2 }, val({g1 , g2 })) is f alse, then the object g2 of s∗ is skipped and the function to be test({g1 , g3 }, val({g1 , g3 }))) is evaluated. We continue this process until we achieve the last object of s∗ . This procedure can be considered as a realization of the diagnostic rule of the second type. To illustrate this procedure, we use the sets D+ and D− represented in Tab.2 and 3 (our illustrative example). In these tables M = {m1 , . . . , m26 }. The set SPLUS0 for positive class of objects is in Tab.4. The initial content of STGOOD0 is {(2,10), (3, 10), (3, 8), (4, 12), (1, 4, 7), (1, 5,12), (2, 7, 8), (3, 7, 12), (1, 2, 12, 14), (2, 3, 4, 7), (4, 6, 8, 11)}. In what follows, we denote subsets of values {m8 , m9 }, {m14 , m15 } by ma and mb , respectively. Applying operation generalization of(s) = s00 = obj(val(s)) to ∀s ∈ STGOOD, we obtain STGOOD1 = {(2,10), (3, 10), (3, 8), (4, 7, 12), (1, 4, 7), (1, 5,12), (2, 7, 8), (3, 7, 12), (1, 2, 12, 14), (2, 3, 4, 7), (4, 6, 8, 11)}. By Th.1, we can delete value m12 from consideration, see splus(m12 ) in Tab.4. The initial content of STGOOD allows to decrease the number of using the procedure to be test() and the number of putting extents of tests into STGOOD. The number of subtasks to be solved. This number is determined by the number of essential values in the set M . The quasi-minimal subset of essential values in M can be found by a procedure analogous to the procedure applicable to search for the initial content of STGOOD. As a result of this procedure, we have quasi-maximal subset sbmax(M ) of M , such that (obj(sbmax(M )), sbmax(M )) is not a test for positive examples. Then subset LEV = M \sbmax(M ) is quasi-minimal subset of essential values in M . We have the following LEV : {m16 , m18 , m19 , m20 , m21 , m22 , m23 , m24 , m26 }. Note, that searching for the number of subtasks to be solved is also based on using the diagnostic rule of the second type. Context-Dependent Reasoning Based on Good Diagnostic Tests 73 Table 2. The set D+ of positive object descriptions G D+ 1 m1 m2 m5 m6 m21 m23 m24 m26 2 m4 m7 m8 m9 m12 m14 m15 m22 m23 m24 m26 3 m3 m4 m7 m12 m13 m14 m15 m18 m19 m24 m26 4 m1 m4 m5 m6 m7 m12 m14 m15 m16 m20 m21 m24 m26 5 m2 m6 m23 m24 6 m7 m20 m21 m26 7 m3 m4 m5 m6 m12 m14 m15 m20 m22 m24 m26 8 m3 m6 m7 m8 m9 m13 m14 m15 m19 m20 m21 m22 9 m16 m18 m19 m20 m21 m22 m26 10 m2 m3 m4 m5 m6 m8 m9 m13 m18 m20 m21 m26 11 m1 m2 m3 m7 m19 m20 m21 m22 m26 12 m2 m3 m16 m20 m21 m23 m24 m26 13 m1 m4 m18 m19 m23 m26 14 m23 m24 m26 Table 3. The set D− of negative object descriptions G D− G D− 15 m3 m8 m16 m23 m24 32 m1 m2 m3 m7 m9 m13 m18 16 m7 m8 m9 m16 m18 33 m1 m5 m6 m8 m9 m19 m20 m22 17 m1 m21 m22 m24 m26 34 m2 m8 m9 m18 m20 m21 m22 m23 m26 18 m1 m7 m8 m9 m13 m16 35 m1 m2 m4 m5 m6 m7 m9 m13 m16 19 m2 m6 m7 m9 m21 m23 36 m1 m2 m6 m7 m8 m13 m16 m18 20 m19 m20 m21 m22 m24 37 m1 m2 m3 m4 m5 m6 m7 m12 m14 m15 m16 21 m1 m20 m21 m22 m23 m24 38 m1 m2 m3 m4 m5 m6 m9 m12 m13 m16 22 m1 m3 m6 m7 m9 m16 39 m1 m2 m3 m4 m5 m6 m14 m15 m19 m20 m23 m26 23 m2 m6 m8 m9 m14 m15 m16 40 m2 m3 m4 m5 m6 m7 m12 m13 m14 m15 m16 24 m1 m4 m5 m6 m7 m8 m16 41 m2 m3 m4 m5 m6 m7 m9 m12 m13 m14 m15 m19 25 m7 m13 m19 m20 m22 m26 42 m1 m2 m3 m4 m5 m6 m12 m16 m18 m19 m20 m21 m26 26 m1 m2 m3 m5 m6 m7 m16 43 m4 m5 m6 m7 m8 m9 m12 m13 m14 m15 m16 27 m1 m2 m3 m5 m6 m13 m18 44 m3 m4 m5 m6 m8 m9 m12 m13 m14 m15 m18 m19 28 m1 m3 m7 m13 m19 m21 45 m1 m2 m3 m4 m5 m6 m7 m8 m9 m12 m13 m14 m15 29 m1 m4 m5 m6 m7 m8 m13 m16 46 m1 m3 m4 m5 m6 m7 m12 m13 m14 m15 m16 m23 m24 30 m1 m2 m3 m6 m12 m14 m15 m16 47 m1 m2 m3 m4 m5 m6 m8 m9 m12 m14 m16 m18 m22 31 m1 m2 m5 m6 m14 m15 m16 m26 48 m2 m8 m9 m12 m14 m15 m16 74 Xenia Naidenova and Vladimir Parkhomenko Table 4. The set SPLUS0 splus(m), m ∈ M splus(m), m ∈ M splus(ma ) → {2, 8, 10} splus(m22 ) → {2, 7, 8, 9, 11} splus(m13 ) → {3, 8, 10} splus(m23 ) → {1, 2, 5, 12, 13, 14} splus(m16 ) → {4, 9, 12} splus(m3 ) → {3, 7, 8, 10, 11, 12} splus(m1 ) → {1, 4, 11, 13} splus(m4 ) → {2, 3, 4, 7, 10, 13} splus(m5 ) → {1, 4, 7, 10} splus(m6 ) → {1, 4, 5, 7, 8, 10} splus(m12 ) → {2, 3, 4, 7} splus(m7 ) → {2, 3, 4, 6, 8, 11} splus(m18 ) → {3, 9, 10, 13} splus(m24 ) → {1, 2, 3, 4, 5, 7, 12, 14} splus(m2 ) → {1, 5, 10, 11, 12} splus(m20 ) → {4, 6, 7, 8, 9, 10, 11, 12} splus(mb ) → {2, 3, 4, 7, 8} splus(m21 ) → {1, 4, 6, 8, 9, 10, 11, 12} splus(m19 ) → {3, 8, 9, 11, 13} splus(m26 ) → {1, 2, 3, 4, 6, 7, 9, 10, 11, 12, 13, 14} Proposition 1. Each essential value is included at least in one positive object description. Proof. Assume that for an object description tg , g ∈ G+ , we have tg ∩ LEV = ∅. Then tg ⊆ M \ LEV. But M \ LEV is included at least in one of the negative object descriptions and, consequently, tg also possesses this property. But it contradicts to the fact that tg is the description of a positive object. t u Proposition 2. Assume that X ⊆ M . If X ∩ LEV = ∅, then to be test(obj(X), X) = f alse. Proposition 3. Let K± be a given classification context. For finding all GRMTs containing in K± , it is sufficient to solve this problem only for subcontexts asso- ciated with essential values in M (the set LEV). Proposition 2 is the consequence of Proposition 1. The poof of Proposition 3 is obvious. Note that the description of t14 = {m23 , m24 , m26 } is closed because of obj{m23 , m24 , m26 } = {1, 2, 12, 14} and val{1, 2, 12, 14} = {m23 , m24 , m26 }. We also know that s = {1, 2, 12, 14} is closed too (we obtained this result during generalization of elements of STGOOD). So (obj({m23 , m24 , m26 }), {m23 , m24 , m26 }) is a maximally redundant test for positive objects and we can, conse- quently, delete t14 from consideration. As a result of deleting m12 and t14 , we have the modified set SPLUS (Tab.5). 5.2 Using the Rule of Alternative for Splitting Subcontexts Essentially, searching for GMRTs turns out a process of generating deductive reasoning rules of the first type, which are used immediately during this process. Now we shall give an example of constructing and using the rules of alternative (forbidden rules). Context-Dependent Reasoning Based on Good Diagnostic Tests 75 Table 5. The set SPLUS1 splus(m), m ∈ M splus(m), m ∈ M splus(ma ) → {2, 8, 10} splus(m22 ) → {2, 7, 8, 9, 11} splus(m13 ) → {3, 8, 10} splus(m23 ) → {1, 2, 5, 12, 13} splus(m16 ) → {4, 9, 12} splus(m3 ) → {3, 7, 8, 10, 11, 12} splus(m1 ) → {1, 4, 11, 13} splus(m4 ) → {2, 3, 4, 7, 10, 13} splus(m5 ) → {1, 4, 7, 10} splus(m6 ) → {1, 4, 5, 7, 8, 10} splus(m7 ) → {2, 3, 4, 6, 8, 11} splus(m18 ) → {3, 9, 10, 13} splus(m24 ) → {1, 2, 3, 4, 5, 7, 12} splus(m2 ) → {1, 5, 10, 11, 12} splus(m20 ) → {4, 6, 7, 8, 9, 10, 11, 12} splus(mb ) → {2, 3, 4, 7, 8} splus(m21 ) → {1, 4, 6, 8, 9, 10, 11, 12} splus(m19 ) → {3, 8, 9, 11, 13} splus(m26 ) → {1, 2, 3, 4, 6, 7, 9, 10, 11, 12, 13} Form all the pairs of objects (i, j) ⊆ G+ ; form a set FR of forbidden pairs of objects, FR = {(i, j) : to be test((i, j), val(i, j)) is f alse} (Tab. 6); then the set of admissible pairs is ADM = {(i, j) : to be test((i, j), val(i, j)) is true} (Tab. 6). Each forbidden pair (i, j) generates rule of alternative: j OR i → true(always); i, j → f alse. This rule says that i and j cannot simultaneously enter in any ex- tent of GMRTs, either i or j but not both. These rules of alternatives allow splitting any splus(m), m ∈ M into subsets not containing any forbidden pair of objects. An example is in Tab. 7. As a result of splitting, we shall obtain the splitting tree, leaves of which are subsets of objects not containing any prohibited object pairs. Subcontexts for GMRTs inferring are leaves of the tree with the number of objects more or equal to 4. So, in Tab. 7, we have only one subcontext (4,7,8,11) for searching for tests. As far as triplets of objects, if a triplet is not the extent of a test, then we have all information about pairs of objects contained in it. An analysis of set ADM leads to using some inductive reasoning rules [19]. If element j enters in sq+1 , then it must enter in q proper subsets of sq+1 , where sq+1 is a set containing (q + 1) elements. If we observe that j enters in only one doublet (pair), then it cannot enter in any triplet. If we observe that j enters in only one triplet, then it cannot enter in any quadruplet and so on. If an element enters in two and only two doublets, it means that it will enter only in one triplet. If an element enters in three and only three doublets, it can enter in only one quadruplet. This inductive reasoning is applicable to constructing triplets from doublets, quadruplets from triplets and so on. In our example: Object 9 enters in one and only one doublet, hence (9,11) cannot be included in any triplet. We conclude that (9,11) is the extent of a GMRT. Object 5 enters in two and only two pairs, hence it is included in only one triplet (1,5,12) already contained in the initial content of STGOOD. Object 5 cannot be included in any quadruplet. Object 6 76 Xenia Naidenova and Vladimir Parkhomenko Table 6. Classification of object pairs g Admissible pairs of objects Forbidden pairs of objects (1,3) (1,6) (1,8) (1,9) (1,10) 1 (1,2) (1,4) (1,5) (1,7) (1,12) (1,11) (1,13) 2 (2,1) (2,3) (2,4) (2,7) (2,8) (2,10) (2,12) (2,5) (2,6) (2,9) (2,11) (2,13) 3 (3,2) (3,4) (3,7) (3,8) (3,10) (3,11) (3,12) (3,1) (3,5) (3,6) (3,9) (3,13) 4 (4,1) (4,2) (4,3) (4, 6) (4,7) (4,8) (4,11) (4,12) (4,5) (4,9) (4,10) (4,13) (5,2) (5,3) (5,4) (5,6) (5,7) 5 (5,1) (5,12) (5,8) (5,9) (5,10) (5,11) (5,13) (6,1) (6,2) (6,3) (6,5) (6,7) 6 (6,4) (6,8) (6,11) (6,9) (6,10) (6,12) (6,13) 7 (7,1) (7,2) (7,3) (7,4) (7,8) (7,11) (7,12) (7,5) (7, 6) (7,9) (7,10) (7,13) 8 (8,2)(8,3)(8,4) (8,6)(8,7) (8,10) (8,11) (8,1) (8,5) (8,9) (8,12) (8,13) 9 (9,11) All the other pairs (10,1) (10,4) (5,10) (10,6) (10,7) 10 (10,2) (10,3) (10,8) (10,9) (10,11) (10,12) (10,13) (11,1) (11,2) (5,11) (11,9) 11 (11,3) (11,4) (6,11) (7,11) (11,9) (11, 10) (11, 12) (11,13) (12,6) (12,8) (12,9) (12,10) 12 (12,1) (12,2) (12,3) (12,4) (12,5) (12,7) (12,11) (12,13) 13 ∅ All the pairs Table 7. An example of splitting splus(m2 ) No splus(m2 ) Forbidden pair 1 (4,7,8,11,12) (8,12) 2 (4,7,8,11), (4,7,11,12) 3 (4,7,11,12) (11,12) 4 (4,7,11) (4,7,12) Context-Dependent Reasoning Based on Good Diagnostic Tests 77 enters in three and only three pairs, hence it is included in only one quadruplet (4,6,8,11), already contained in STGOOD. Object 13 is not included in any admissible pair of objects. We put {9, 11} and {13} into STGOOD and delete objects 9, 5, 6, and 13 from consideration. This action implies deleting values m16 , m18 , and m23 (see, please Tab. 8), because splus(mi ), where i take values 16, 18 or 23, are absorbed by some elements of STGOOD. Table 8. The set SPLUS2 splus(m), m ∈ M splus(m), m ∈ M splus(ma ) → {2, 8, 10} splus(m22 ) → {2, 7, 8, 11} splus(m13 ) → {3, 8, 10} splus(m23 ) → {1, 2, 12} splus(m16 ) → {4, 12} splus(m3 ) → {3, 7, 8, 10, 11, 12} splus(m1 ) → {1, 4, 11} splus(m4 ) → {2, 3, 4, 7, 10, 13} splus(m5 ) → {1, 4, 7, 10} splus(m6 ) → {1, 4, 7, 8, 10} splus(m7 ) → {2, 3, 4, 6, 8, 11} splus(m18 ) → {3, 10} splus(m24 ) → {1, 2, 3, 4, 7, 12} splus(m2 ) → {1, 10, 11, 12} splus(m20 ) → {4, 7, 8, 10, 11, 12} splus(mb ) → {2, 3, 4, 7, 8} splus(m21 ) → {1, 4, 8, 9, 10, 11, 12} splus(m19 ) → {3, 8, 11} splus(m26 ) → {1, 2, 3, 4, 7, 10, 11, 12} We analyze triplets in SPLUS: splus(m∗ ) = {2, 8, 10}, splus(m1 ) = {1, 4, 11}, splus(m19 ) = {3, 8, 11}, and splus(m13 ) = {3, 8, 10}. As a result, we obtain the extents of two new GMRTs {3, 11} and {8, 10}. After that, we can delete object 10. Then we can delete values m5 , m2 , and m4 after analyzing their modified splus. Now we split splus(m22 ), splus(m20 ), splus(m21 ), splus(m24 ), and splus(m26 ), because these values are the only remaining essential values in the set LEV. Tab. 9 shows the number of subtasks in splitting trees for these essential values and new obtained tests. Table 9. Splitting splus(m), m is essential value splus(m) The number of subtasks New tests Deleted elements splus(m22 ) {7, 8, 11} m26 splus(m20 ) 1 m20 splus(m21 ) m21 , t8 m24 , t1 , t12 , t7 ; remaining set splus(m24 ) 2 of indices {2, 3, 4} is absorbed by element {2, 3, 4, 7} in STGOOD 78 Xenia Naidenova and Vladimir Parkhomenko Some words about future investigations. The most important mathematical tasks worth solving are the following: 1. For irredundant splus(m) to determine whether it contains extents of new GMRTs; 2. Optimization of selecting forbidden pairs of objects in order not to obtain repetition of subsets in splitting trees. 6 Conclusion A sketch of classification reasoning is given in the paper. The key notions of the reasoning are notions of classification and its good approximations based on good diagnostic tests. The good diagnostic tests have the dual nature: they generate functional and implicative dependencies and, in the same time, sets of objects satisfying these dependencies. The dependencies are logical assertions on the base of which deductive phase of classification reasoning is realized. In the paper, we deal only with GMRTs as value implications. An algorithm for inferring GMRTs from a set of data is given. A decomposition of the main task of inferring GMRTs into subtasks associated with the certain types of sub- contexts is introduced. There are various ways of using the proposed decompositions and extracted initial information about GMRTs, but we confine ourselves to the most impor- tant ideas allowing to obtain not only tests (and corresponding knowledge) but also, simultaneously, contexts inside of which these tests have been inferred. Currently, there is considerable interest in contextual information search and representation for using it to improve web search results or query expansion [3,4,10]. Classification reasoning can serve as a basic tool for constructing and using content-dependent knowledge in different problem domains and applica- tions. References 1. Baixeries, J., Kaytoue, M., Napoli, A.: Computing functional dependencies with pattern structures. In: Szathmary, L., Priss, U. (eds.) Proceedings of The Ninth In- ternational Conference on Concept Lattices and Their Applications. CEUR Work- shop Proceedings, vol. 972, pp. 175–186. CEUR-WS.org (2012) 2. Banerji, R.B.: Theory of problem solving: an approach to artificial intelligence, vol. 17. Elsevier Publishing Company (1969) 3. Bankar, S., Nagpure, R.: Contextual information search based on domain using query expansion. In: International Journal of Emerging Trends and Technology in Computer Science. vol. 3, pp. 148 – 151 (2014) 4. Bouramoul, A., Kholladi, M.K., Doan, B.L.: PRESY: A Context Based Query Reformulation Tool for Information Retrieval on the Web. Journal of Computer Science 6(4), 470–477 (2010) 5. Ferré, S., Ridoux, O.: The use of associative concepts in the incremental building of a logical context. In: Priss, U., Corbett, D., Angelova, G. (eds.) ICCS. Lecture Notes in Computer Science, vol. 2393, pp. 299–313. Springer (2002) Context-Dependent Reasoning Based on Good Diagnostic Tests 79 6. Finn, V.: The synthesis of cognitive procedures and the problem of induction. NTI ser.2, 1-2, 8–44 (1999), (in Russian) 7. Ganter, B., Wille, R.: Formal concept analysis: mathematical foundations. Springer, Berlin (1999) 8. Ganter, B., Kuznetsov, S.O.: Formalizing hypotheses with concepts. In: Proceed- ings of the Linguistic on Conceptual Structures: Logical Linguistic, and Computa- tional Issues. pp. 342–356. Springer-Verlag (2000) 9. Ganter, B., Kuznetsov, S.: Pattern structures and their projections. In: Delugach, H., Stumme, G. (eds.) Conceptual Structures: Broadening the Base, Lecture Notes in Computer Science, vol. 2120, pp. 129–142. Springer Berlin Heidelberg (2001) 10. Klimešová, D., Ocelı́ková, E.: Study on context understanding, knowledge trans- formation and decision support systems. WSEAS Trans. Info. Sci. and App. 7(7), 985–994 (Jul 2010) 11. Kuznetsov, S.: Machine learning on the basis of formal concept analysis. Automa- tion and Remote Control 62(10), 1543–1564 (2001) 12. Mill, J.S.: The System of Logic Ratiocinative and Inductive Being a Connected View of the Principles of Evidence, and the Methods of Scientific Investigation. Longmans (1872) 13. Naidenova, K.: Reducing one class of machine learning algorithms to logical oper- ations of plausible reasoning. Journal of Computer and Systems Sciences Interna- tional 48(3), 401–414 (2009) 14. Naidenova, X.A., Plaksin, M.V., Shagalov, V.L.: Inductive Inferring All Good Clas- sification Tests. In: Valkman, J. (ed.) ”Knowledge-Dialog-Solution”, Proceedings of International Conference. vol. 1, pp. 79 – 84. Kiev Institute of Applied Informatics, Kiev, Ukraine (1995) 15. Naidenova, X.: Machine Learning as a diagnostic task. In: Arefiev, I. (ed.) Knowledge-Dialog-Solution, Materials of the Short-Term Scientific Seminar, pp. 26 – 36. State North-West Technical University Press, St Petersburg, Russia (1992) 16. Naidenova, X.: The Data-Knowlege Transformation. In: Solovyev, V. (ed.) Text Processing and Cognitive Technologies, vol. 3, pp. 130–151. Pushchino (1999) 17. Naidenova, X.: Good classification tests as formal concepts. In: Domenach, F., Ignatov, D., Poelmans, J. (eds.) Formal Concept Analysis, Lecture Notes in Com- puter Science, vol. 7278, pp. 211–226. Springer Berlin Heidelberg, Leuven (2012) 18. Naidenova, X., Ermakov, A.: The decomposition of good diagnostic test inferring algorithms. In: Alty, J., Mikulich, L., Zakrevskij, A. (eds.) ”Computer-Aided De- sign of Discrete Devices” (CAD DD2001), Proceedings of the 4-th Inter. Conf., vol. 3, pp. 61 – 68. Minsk (2001) 19. Naidenova, X.: An incremental learning algorithm for inferring logical rules from examples in the framework of the common reasoning process. In: Triantaphyllou, E., Felici, G. (eds.) Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques, Massive Comp., vol. 6, pp. 89–147. Springer (2006) 20. Naidenova, X.A., Parkhomenko, V.: Attributive and object subcontexts in inferring good maximally redundant tests. In: Bertet, K., Rudolph, S. (eds.) Proceedings of the Eleventh International Conference on Concept Lattices and Their Applications, Košice, Slovakia, October 7-10, 2014. CEUR Workshop Proceedings, vol. 1252, pp. 181–193. CEUR-WS.org (2014) 21. Ore, O.: Galois connections. Trans. Amer. Math. Soc 55, 494–513 (1944) 22. Rasiowa, H.: An Algebraic Approach to Non-classical Logics. Studies in logic and the foundations of mathematics, North-Holland Publishing Company (1974) Concept Lattices of RDF Graphs Jens Kötters Abstract. The concept lattice of an RDF graph is defined. The intents are described by graph patterns rather than sets of attributes, a view that is supported by the fact that RDF data is essentially a graph. A sim- ple formalization by triple graphs defines pattern closures as connected components of graph products. The patterns correspond to conjunctive queries, generalization of properties is supported. The relevance of the notion of connectedness is discussed, showing the impact of ontological considerations on the process of concept formation. Finally, definitions are given which allow the construction of pattern closures of concept graphs, which formalize Conceptual Graphs. Keywords: RDF Schema, Pattern Concepts, Conceptual Graphs, Nav- igation 1 Introduction The Resource Description Framework (RDF) is an extensible standard for knowl- edge representation which relies on the use of simple sentences, each consisting of a subject, a predicate and an object. In RDF terminology, such a sentence is called a triple; a collection of triples is called an RDF graph, and the entities which occur as subjects, predicates or objects of triples (i.e.,the things being talked about) are called resources. Figure 1 shows an RDF graph in the text- based Turtle notation [5] (namespace declarations are omitted). Figure 2 shows the same RDF graph, using a standard graphical notation (see e.g. [9]). Each triple is represented by an arc. In the abundance of such data, query languages, most notably SPARQL [1], can be used to gain access to the information con- tained. Querying alone is however of limited use to anyone who needs information but does not know specifically what to look for, or how to ask for it, or maybe whether such information can be found at all. In such cases, concept lattices can in principle guide the exploration of the data source, as they support interac- tive and data-aware query modification. In connection with SPARQL, this has already been demonstrated in [7]. The current paper is also motivated by data exploration, but deliberately limits the query language to conjunctive queries. In this case, the entire search space is obtained as a concept lattice in which each intent is described by a single most specific query (MSQ). Conjunctive queries can be represented as graphs, and entailment is described by graph homomorphisms [6]. Figure 3 shows a graph representing a conjunc- tive query – ”Someone who spoke about a sailor (in the subject)” – and its SPARQL translation below, which can always be obtained in a straightforward way. Graph queries are modeled after RDF graphs (Fig.2), so that solutions are c 2015 for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. Local Proceedings also appeared in ISBN 978-84-606-7410-8, Universidad de Málaga (Dept. Matemática Aplicada), Spain. 82 Jens Kötters ex:Frank ex:loves ex:Mary . ex:Frank rdf:type ex:Sailor . ex:Frank rdf:type ex:Englishman . ex:Tom ex:loves ex:Mary . ex:Tom rdf:type ex:Gentleman . ex:Tom ex:dislikes ex:Frank . ex:Frank ex:dislikes ex:Tom . ex:Tom ex:name "Tom" . ex:Tom ex:said ex:S1 . ex:S1 rdf:subj ex:Tom . ex:S1 rdf:pred ex:name . ex:S1 rdf:obj "Tom" . ex:Frank ex:jested ex:S2. ex:S2 rdf:subj ex:Frank . ex:S2 rdf:pred ex:likes . ex:S2 rdf:obj ex:Lobscouse . ex:Liam ex:loves ex:Aideen . ex:Aideen ex:loves ex:Liam . ex:Rowan ex:loves ex:Aideen . ex:Rowan ex:hates ex:Liam . ex:Aideen ex:said ex:S3 . ex:S3 rdf:subj ex:Rowan . ex:S3 rdf:pred ex:likes . ex:S3 rdf:obj ex:Aideen . Fig. 1. Sample RDF graph in Turtle notation rdf:pred ex:S2 ex:S1 rdf:pred ex:likes ex:Mary rdf:subj ex:name rdf:obj rdf:subj ex:jested ex:loves ex:loves ex:said rdf:obj ex:lobscouse ex:dislikes rdf:type ex:Frank ex:Tom ex:dislikes ex:name ex:Sailor rdf:type rdf:type "Tom" ex:Englishman ex:Gentleman ex:Rowan rdf:subj ex:hates ex:S3 ex:loves ex:said rdf:pred ex:loves ex:Liam ex:Aideen rdf:obj ex:likes ex:loves Fig. 2. Drawing of the RDF graph of Fig.1 Concept Lattices of RDF Graphs 83 described by homomorphisms, as well. But there are some differences to RDF graphs. Firstly, all nodes (and all arcs) represent variables. The subject(s) of the query are distinguished in some way (in Fig. 3, a black node represents the single subject), whereas all other variables are implicitly understood as being existen- tially quantified. Finally, the labels are attributes of a formal context K (for the example, see Fig. 4), and thus represent formal concepts of its concept lattice B(K). In practice, the formal context may encode background knowledge from an RDF Schema file like the one in Fig. 5. The query vocabulary is restricted to the attributes of K, the name Frank e.g. may not be used in a query for the given example. ?x ex:said rdf:type rdf:subj ex:Sailor SELECT ?x WHERE { ?x ex:said ?y. ?y rdf:subj ?z. ?z rdf:type ex:Sailor. } Fig. 3. Conjunctive query as a graph (above) and in SPARQL (below) For the rest of the paper, we shall refer to queries as patterns. Sect. 2 intro- duces triple graphs as formalizations of queries without distinguished variables (they correspond to SPARQL WHERE-clauses), and a query in one distinguished variable is then represented by the pair (x, G) consisting of a triple graph G and a vertex x of G. The case of queries in several variables is not treated in this paper, but has been treated in [11] in connection with relational structures. Section 3 shows how the search space for conjunctive queries over an RDF graph is obtained as a concept lattice, where concept intents are patterns rather than sets of attributes. Lattices of pattern concepts were introduced in [8], and although the theory presented there does not make specific assumptions of what a pattern is, it seems clear that graphs were meant to play a prominent role. In fact, the paper already presents an example application where patterns are chemical compounds and compares them by labeled graph homomorphisms. While chemical compounds are described by connected graphs, their supremum in terms of these homo- morphisms is not necessarily connected (ibid., pp. 139-140). Another suggestion made in [8] is to use Conceptual Graphs (CGs) as patterns. If graph patterns are used to describe situations (where several objects combine to a whole by means of the way they are related to each other), we would probably expect them to be connected. But if we formalize CGs using the same kind of labeled graphs that was used for chemical compounds, the set of patterns is generally not closed under suprema because, as we have seen, the supremum operation does not preserve connectedness. 84 Jens Kötters Englishman Gentleman lobscouse Resource dislikes ”Tom” Properties and Classes Person jested Sailor name hates loves Man pred type likes subj said obj name × × type × × loves ×× × likes × × hates ×× × dislikes × × jested ×× × said × × subj × × pred × × obj × × Englishman × × × × Gentleman ×× × × Man × × × Sailor ×× × Person × × ”Tom” × × lobscouse ×× Resource × Fig. 4. Formal context K of properties and classes ex:Englishman rdfs:subClassOf ex:Man ex:Gentleman rdfs:subClassOf ex:Man ex:Sailor rdfs:subClassOf ex:Person ex:Man rdfs:subClassOf ex:Person ex:loves rdfs:subPropertyOf ex:likes ex:hates rdfs:subPropertyOf ex:dislikes ex:jested rdfs:subPropertyOf ex:said ex:loves rdfs:domain ex:Person ex:loves rdfs:range ex:Person Fig. 5. An RDFS ontology Concept Lattices of RDF Graphs 85 The situation changes if graphs have a distinguished element. The supremum is a graph product, again with a distinguished element, and although the product of connected graphs is generally not connected, it makes sense to identify the supremum with the connected component which holds the distinguished variable. The relevance of connectedness is discussed in Sect. 4. In Sect. 5, the approach is applied to concept graphs. 2 Triple Graphs A triple graph over K is a triple (V, T, κ), where V is a set of vertices, T ⊆ V × V × V and κ : V → B(K). A morphism ϕ : (V1 , T1 , κ1 ) → (V2 , T2 , κ2 ) of triple graphs is a map ϕ : V1 → V2 with (x, y, z) ∈ T1 ⇒ (ϕx, ϕy, ϕz) ∈ T2 , (1) κ1 (x) ≥ κ2 (ϕx) (2) for all x, y, z ∈ V1 . The product of a family of triple graphs Gi = (Vi , Ti , κi ), i ∈ I, is the triple graph Y _ i∈I Gi := ( × Vi , { tT | t ∈ i∈I × Ti }, i∈I κi ), i∈I (3) W W where tT := ((ti0 )i∈I , (ti1 )i∈I , (ti2 )i∈I ) and ( i∈I κi )(v) := i∈I κi (vi ). Finally, an RDF graph with the set T of triples is represented by the triple graph (R, T, γ ∗ ), where R is the set of resources (properties and classes are duplicated so that they appear in only one triple) and ( (g 00 , g 0 ) if g ∈ G, γ (g) := ∗ . (4) > := (G, G0 ) otherwise A pattern is a pair (v, H), where H =: (V, T, κ) is a triple graph and v ∈ V . A pattern morphism ϕ : (v1 , H1 ) → (v2 , H2 ) is a morphism ϕ : H1 → H2 with ϕ(v1 ) = v2 . A solution is a pattern in the data graph. Pattern morphisms can thus describe solutions as well as entailment. The product of a family of patterns is given by Y Y (vi , Hi ) := ((vi )i∈I , Hi ). (5) i∈I i∈I The definitions above follow a category theoretical approach. Queries (minus distinguished variable(s)) and the data source have representations in the same category (using ∆ for the data source), and morphisms λ : G → ∆ can be under- stood as solutions of the respective query. Query entailment is then described by morphisms, and if the category has products, then so has the category of structured arrows (these are the pairs (ν, G), cf. [2]), which model queries with distinguished variables. The pattern closures (MSQs) arise as products from the set of possible solutions (λ, ∆). For convenience, we write (x, G) instead of (ν, G) if the domain of ν is a one-element set (x being the image of that element). 86 Jens Kötters 3 Example RDF Graph and Concept Lattice In this section, we will walk through the process of generating a concept lattice for the RDF graph in Fig. 1. To keep the example small (and perhaps meaning- ful), only person concepts will be created, i.e. concepts with extents in the set {Tom, Frank, Mary, Rowan, Liam, Aideen}. As a first step, we determine the object intent for each person, i.e. its most complete description. For each person, its connected component in Fig. 2 dou- bles as a most complete description if we reinterpret the nodes and arcs as (distinct) variables, reinterpret the labels beside them as concept names (rather than resource names), and mark the variable corresponding to that person as distinguished. Let G1 and G2 denote, top to bottom, the components in Fig. 2 after reinterpretation, and let T , F , M , R, L and A denote the variables which re- place Tom, Frank, Mary, Rowan, Liam and Aideen. Then the pairs (T, G1 ), (F, G1 ), (M, G1 ), (R, G2 ), (L, G2 ) and (A, G2 ) formalize the object intents. These pairs are considered patterns; each pattern consists of a distinguished variable and a triple graph (formalized in Sect. 2). The person concepts are obtained from graph products of the six base pat- terns. Many concept intents have the same underlying triple graph (consider e.g. the base patterns), so there would be a lot of redundancy in the concept lattice if we were to draw intents next to each concept. A different kind of diagram avoids this redundancy while still showing all concepts and their intents : Fig. 6 shows all triple graphs which occur in the intents of person concepts, ordered by homomorphism. Choosing any of the nodes as a distinguished variable gives rise to a pattern intent, and this way concepts can be identified with triple graph nodes. The gray nodes in Fig. 6 are the person concepts (the white nodes are non-person concepts which occur in the descriptions of person concepts). The concept extents for the person concepts are drawn next to the gray nodes. The person concept lattice itself can be easily obtained from the diagram in Fig. 6 by connecting the gray nodes according to their extents. Note however that some concepts have multiple representations in Fig. 6. These may occur in the same triple graph, which indicates symmetry (i.e., a triple graph automorphism). An example is the pair of lovers on the right side of Fig. 6. The other case is where the same concept occurs in different triple graphs. An example would be the Mary concept, which occurs in the lower left triple graph as well as in the triple graph above it on the left side of Fig. 6. In such cases, there is always a most specific triple graph containing that concept; it describes that concept’s pattern intent. Other triple graphs containing the same concept can be folded onto the most specific triple graph (which is contained as a subgraph). But the fact that triple graphs may reappear as subgraphs of other triple graphs translates into another redundancy problem when drawing the hierarchy of triple graph product components. In Fig. 6, this kind of redundancy has been avoided by cutting those subgraphs out and indicate the glue nodes (for the gluing operation which inverts the cutting operation) with an asterisk. The glue nodes are only marked in the upper triple graph, not in the lower triple graph, because the lower triple graph can be considered self-contained (a Concept Lattices of RDF Graphs 87 L F RT AM LR A A L FT loves M LR FT loves * MA loves A loves L R L * loves * A R A L F pred T loves said loves *L A subj dislikes * MA F obj M L loves T F T pred pred loves subj M subj likes R * loves A obj loves * dislikes loves obj loves said pred R F loves A L L loves A said said M A L dislikes subj F T L A dislikes F M obj T F loves Man Man L T likes pred pred M subj subj R loves name subj obj loves hates loves jested said pred loves said lobscouse dislikes L type obj loves obj likes dislikes T name A F type type Sailor Gentleman Englishman "Tom" T T Fig. 6. Concepts of the RDF graph 88 Jens Kötters glue node’s counterpart in the lower graph can be identified by its extent). To put this differently: in every triple graph, the asterisk concepts are part of the description of the non-asterisk concepts, but not vice versa. The extent of a concept reveals how its pattern intent can be obtained. Consider for example the concept with extension {R, F, T }, which is part of the top left triple graph. The extension tells us that the pattern intent is the (core graph of the) component of the pattern product (R, G2 )×(F, G1 )×(T, G1 ), which contains the vertex (R, F, T ), and (R, F, T ) becomes the designated variable. The diagram also shows that {R, T } is a generating set for this extent, because it is not contained in a concept extent of the triple graph’s lower neighbors, and so (R, G2 ) × (T, G1 ) already produces the same pattern. As was described in [8], Ganter’s NextConcept algorithm can be used to systematically generate all extents (and thus all concepts), computing pattern products in the process. It is advisable to use graph folding to obtain minimal equivalent patterns, for output as well as inbetween computational steps. Product patterns could also be computed directly from the triples in Turtle notation (Fig. 1), by combining triples with each other. Informally, we could write triple graphs as sets of triples (x : κ(x), p : κ(p), y : κ(y)) (borrowing from RDF and CG notation), and compute the product triples by taking suprema componentwise, like so: (F : >, p : type, y1 : Englishman) ∨ (T : >, p : type, y2 : Gentleman) = ([ FT ] : >, [ pp ] : type, [ yy12 ] : Man). However, one would still have to identify and minimize connected components. Note that the property and class hierarchies in our example are trees. So the attribute concepts of K are closed under suprema, and each concept label can be expressed by a single attribute. The suprema of properties may be of type ”Resource”. Such arcs are removed from the patterns; it can be shown (using the product property) that the resulting patterns can still be considered MSQs. 4 Connectedness We have identified two components of the RDF graph in Fig. 1, and this cor- responds to our understanding that objects are ”connected” if they contribute to the same instance of a situation. The notion of connectedness deserves closer examination, however. Let us first observe that the notion of strong connectivity is not the right one, because the second to top concepts in Fig. 6 (we might name them ”Lover” and ”Beloved”) would not have been created under this notion. But also, we can in general not rely on weak connectivity alone. If it had been stated explicitly that all persons are of type ”Person”, all persons would be connected through that class node. Clearly, objects do not contribute to the same situation just on the basis of belonging to the same class, or having equal property values, like being of the Concept Lattices of RDF Graphs 89 same age. So connectedness of patterns should not rely on properties, classes or values. In this paper, no further requirements have been made, but if Liam liked lobscouse (a traditional sailors’ dish), this would have not been sufficient because all objects would be connected through the lobscouse node. From a machine perspective, there is no indication in the RDF data (or the schema) that ”ex:lobscouse” is of a different quality than ”ex:Tom” or ”ex:Mary”. A sys- tem that generates patterns from automatically acquired data without proper preprocessing would possibly generate a large number of meaningless (and un- necessarily large) concepts because of such insubstantial connections, unless it can be guaranteed that some standard for making such distinctions is applied on the level of the ontology language. Further questions may include if and how real-world objects should be con- nected through objects of spoken language; on what basis we could allow a pattern like the one describing the set of all wifes older than their husband, and at the same time keep meaningless comparison of values from having an impact on concept formation; or if pattern connection should be described in terms of edges rather than nodes, and if we can classify or characterize the kind of relations which contribute to pattern formation. It seems that, when dealing with pattern concepts, questions of philosophical ontology may have (at least subjectively) measurable impact through the notion of connectedness. 5 Related Work In Sect. 2, the category theoretical framework for lattices of pattern concepts has been applied to triple graphs, and we will now show how it is applied to simple concept graphs. In [15], a simple concept graph over a power context family ~ = (K0 , K1 , K2 , . . . ) is defined as a 5-tuple (V, E, ν, κ, ρ). We may interpret K the 4-tuple (V, E, ν, κ) as a query, K ~ as a data source and the map ρ as a set of solutions. The map κ assigns concepts of the contexts Ki to the vertices and edges in V and E, respectively. The definition of queries should be independent from any particular data source, so it makes sense to interpret K ~ as a power context family describing schema information (background knowledge) instead, the contexts could describe attribute implications, like the one in Fig. 4. The map ρ will however not be part of the query. For the rest of the exposition, we shall call the 4-tuple (V, E, ν, κ) an abstract concept graph over K, ~ after a term used in the first paper on concept graphs [14, p.300]. In [15], the triple (V, E, ν) is called a relational graph. The morphisms in the category of abstract concept graphs over K ~ are relational graph morphisms (which replaces condition (1) for triple graphs) satisfying (2). The product is de- fined in analogy to (3) (cartesian products of edges have to be taken individually for each arity k). A datasource for the schema K ~ is a power context family D ~ = (D0 , D1 , D2 , . . . ) for which the attribute implications described by Ki =: (Hi , Ni , Ji ) are satisfied in Di =: (Gi , Mi , Ii ), i = 0, 1, 2, . . . , andSNi ⊆ Mi holds. We can represent D ~ by an abstract concept graph D = (G0 , i≥1 Gi , id, γD ) with γD (u) := ((uIk ∩ 90 Jens Kötters Nk )Jk , uIk ∩ Nk ) ∈ B(Kk ) for u ∈ Gk . Let us furthermore define ExtD (κ(u)) := Ext(κ(u))Jk Ik for u ∈ V ∪ E. If we define a realization of G =: (V, E, κ, ν) over D as a map with ρ(u) ∈ Ext∆ (κ(u)) for all u ∈ V ∪E and ρ(e) = (ρ(v1 ), . . . , ρ(vk )), we obtain that the realizations ρ of G over D are precisely the morphisms ρ : G → D. This means that product patterns for abstract concept graphs can be interpreted as MSQs, as was mentioned at the end of Sect. 1. Triple graphs are now obtained as a special case of concept graphs: We define a power context family D ~ where D0 is a supercontext of K0 which in addition contains all objects (having only the ”Resource” attribute), and where D3 = (T, ∅, ∅) represents the set of triples. In Relational Concept Analysis, concept lattices for different kinds of inter- related objects are generated by an iterative algorithm and, as in Fig. 6, illus- trations displaying graphs of concepts have been given [4, 10]. Computational aspects of generating pattern structures are covered e.g. in [13, 12]. In [3], con- cept lattices are generated from Conceptual Graphs, but the approach seems to be tailored towards a more specific case where edges (and thus paths) express dependencies. A comparison with Relational Semantic Systems [16] still has to be made. 6 Conclusion In [11], a lattice of closures of database queries, which links products of query patterns to the join operation on result tables, has been introduced. The queries and database were formalized by relational structures. The fact that the method could be applied again, first to RDF graphs and then to abstract concept graphs, suggests that the underlying category theoretical notions provide a recipe that could be applied to still different formalizations of queries, allowing in every case the mathematical description of lattices of most specific queries. Combinatorial explosion is a computational problem that quickly turns up when computing the concept lattices. From a practical perspective, the lattices are useless if complex- ity problems can not be solved. However, in order to support data exploration, patterns must be understandable, thus also limited in size, and a solution to this problem may entail a solution to the problem of computational complexity. In the section on connectedness, we have seen that ontological considerations stand in a direct relation to the quality of computed concepts. As a first con- sequence, although the idea of treating all kinds of resources in a homogeneous way seemed appealing, triple graphs must be replaced by some other formal- ization which reflects the importance of the distinction between instances and classes/properties. While such questions naturally arise in the setting of pattern concepts, they seem to have no obvious analogy for standard FCA, where intents are sets of attributes. Pattern concepts have thus the potential to further and substantiate a perspective on FCA as a branch of mathematical modeling, where the entities to be modeled are not ”real world” systems but rather systems of concepts. Future work may be concerned with extensions of RDF/RDFS which support this perspective. Concept Lattices of RDF Graphs 91 References 1. SPARQL 1.1 Query Language. Tech. rep., W3C (2013), http://www.w3.org/TR/ sparql11-query 2. Adámek, J., Herrlich, H., Strecker, G.E.: Abstract and concrete categories : the joy of cats. Pure and applied mathematics, Wiley, New York (1990) 3. Andrews, S., Polovina, S.: A mapping from conceptual graphs to formal concept analysis. In: Andrews, S., Polovina, S., Hill, R., Akhgar, B. (eds.) Proceedings of ICCS 2011. LNCS, vol. 6828, pp. 63 – 76. Springer (2011) 4. Azmeh, Z., Huchard, M., Napoli, A., Hacene, M.R., Valtchev, P.: Querying rela- tional concept lattices. In: Proc. of the 8th Intl. Conf. on Concept Lattices and their Applications (CLA’11). pp. 377–392 (2011) 5. Beckett, D., Berners-Lee, T., Prud’hommeaux, E., Carothers, G.: RDF 1.1 Turtle. Tech. rep., W3C (2014), http://www.w3.org/TR/turtle 6. Chandra, A.K., Merlin, P.M.: Optimal implementation of conjunctive queries in relational databases. In: Proceedings of the ninth annual ACM symposium on Theory of computing. pp. 77–90. STOC ’77, ACM, New York, NY, USA (1977) 7. Ferré, S.: Conceptual navigation in RDF graphs with SPARQL-like queries. In: Kwuida, L., Sertkaya, B. (eds.) Proceedings of ICFCA 2010. LNCS, vol. 5986, pp. 193–208. Springer, Heidelberg (2010) 8. Ganter, B., Kuznetsov, S.O.: Pattern structures and their projections. In: Delu- gach, H.S., Stumme, G. (eds.) Proceedings of ICCS 2001. LNCS, vol. 2120, pp. 129–142. Springer (2001) 9. Hitzler, P., Krötzsch, M., Rudolph, S.: Foundations of Semantic Web Technologies. Chapman & Hall/CRC (2009) 10. Huchard, M., Hacene, M.R., Roume, C., Valtchev, P.: Relational concept discovery in structured datasets. Annals of Mathematics and Artificial Intelligence 49(1-4), 39–76 (2007) 11. Kötters, J.: Concept lattices of a relational structure. In: Pfeiffer, H.D., Ignatov, D.I., Poelmans, J., Gadiraju, N. (eds.) Proceedings of ICCS 2013. LNCS, vol. 7735, pp. 301–310. Springer (2013) 12. Kuznetsov, S.O.: Computing graph-based lattices from smallest projections. In: Wolff, K.E., Palchunov, D.E., Zagoruiko, N.G., Andelfinger, U. (eds.) KONT/KPP. Lecture Notes in Computer Science, vol. 6581, pp. 35–47. Springer (2007) 13. Kuznetsov, S.O.: Fitting pattern structures to knowledge discovery in big data. In: Formal Concept Analysis, 11th International Conference, ICFCA 2013, Dresden, Germany, May 21-24, 2013. Proceedings. pp. 254–266 (2013), http://dx.doi.org/ 10.1007/978-3-642-38317-5_17 14. Wille, R.: Conceptual graphs and formal concept analysis. In: Lukose, D., Delu- gach, H.S., Keeler, M., Searle, L., Sowa, J.F. (eds.) Proceedings of ICCS 1997, 5th International Conference on Conceptual Structures. LNCS, vol. 1257, pp. 290–303. Springer, Heidelberg (1997) 15. Wille, R.: Formal concept analysis and contextual logic. In: Hitzler, P., Schärfe, H. (eds.) Conceptual Structures in Practice, pp. 137–173. Chapman & Hall/CRC (2009) 16. Wolff, K.E.: Relational scaling in relational semantic systems. In: Rudolph, S., Dau, F., Kuznetsov, S.O. (eds.) Proceedings of ICCS 2009. LNCS, vol. 5662, pp. 307–320. Springer (2009) Variability representation in product lines using concept lattices: feasibility study with descriptions from Wikipedia’s product comparison matrices Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez LIRMM, CNRS & Université de Montpellier, France jessiecarbonnel@gmail.com surname.lastname@lirmm.fr Abstract. Several formalisms can be used to express variability in a product line. Product comparison matrix is a common and simple way to display variability of existing products from a same family, but they lack of formalisation. In this paper, we focus on concept lattices, another alternative already explored in several works to express variability. We first propose a method to translate a description from existing product comparison matrices into a concept lattice using Formal Concept Analy- sis. Then, we propose an approach to represent the case where a product family is described by other product families with interconnected lat- tices using Relational Concept Analysis. Because of the combinatorial aspect of these approaches, we evaluate the scalability of the produced structures. We show that a particular structure (AOC-poset) possesses interesting properties for the studies that we envision. Keywords: Product lines, Formal Concept Analysis, Variability Rep- resentation. 1 Introduction In product line engineering [5], several formalisms can be used to depict vari- ability. Variability representation usually requires to take into account a large amount of data, and it is important to provide tools to help designers to represent and exploit them. Among existing formalisms, the most common is Product Comparison Matri- ces (PCMs). PCMs describe product properties in a tabular form. It is a simple way to display features of products from a same family and to compare them. However, there is no existing format or good practices to design these PCMs. Therefore, cells in PCMs lack of formalisation and it is difficult to perform auto- matic and efficient processing or analysis on them [4, 17]. Feature Models (FMs) constitute an alternative to PCMs. FMs describe a set of existing features and constraints between them, and thus represent all possible configurations of prod- ucts from a same family [6, 9, 11, 12]. They depict variability in a more formal c 2015 for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. Local Proceedings also appeared in ISBN 978-84-606-7410-8, Universidad de Málaga (Dept. Matemática Aplicada), Spain. 94 Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez way than PCMs, but, in their stardard form, FMs focus exclusively on features and do not specify if an existing product is associated with a configuration. Formal Concept Analysis (FCA) and concept lattices have already been stud- ied to express variability [13]. From a set of objects described by attributes, FCA computes subsets of objects that have common attributes and structures these subsets in a hierarchical way in concept lattices. Concept lattices can represent in a more formal way than PCMs informations related to variability. Through their structure, concept lattices highlight constraints between attributes like FMs, while keeping the relation between existing products of the family and the possible configurations. Besides, contrarily to FMs, which can have many various forms depending design choices, concept lattices represent variability in a canonical form. Moreover, FCA can be extended by Relational Concept Anal- ysis (RCA) which permits to take into account relationships between objects of separate lattices and provide a set of interconnected lattices. This is useful to classify sets of products from different categories. Concept lattices formal and structural aspects make them good candidates to apply automatic or manual processing including product comparison, research by attributes, partial visualisation around points of interest, or decision support. But the exponential growth of the size of concept lattices can make them difficult to exploit. In this paper, we will study the dimensions of these structures and try to find out if depicting product line variability with concept lattices provides structures that can be exploited from a perspective of size. For this, we will build in a first phase concept lattices from existing descriptions. Because there are abundant and focus on both products and features, we will extract descriptions from PCMs. Besides, we can find PCMs that possess a feature whose value domain corresponds to a set of products described by another PCM. It is a special case where a product family is described by another product family. In a second phase, we will model this case by interconnecting concept lattices obtained from the descriptions of these two PCMs using Relational Concept Analysis. In this paper, we want to answer these questions: How can we represent the variability expressed by a PCM with a concept lattice? To what extent can we model the case where a product family is described by another product family with RCA? Can we efficiently exploit structures obtained with FCA and RCA with regard to their size? The remainder of this paper is organised as follows. In Section 2, we will re- view Formal Concept Analysis and propose a way to build a concept lattice from a PCM’s description. In Section 3, we will study how to represent the case where a product family is described by another product family with Relational Concept Analysis. Then, in Section 4, we will apply these two methods on Wikipedia’s PCMs to get an order of magnitude of the obtained structure size. Section 5 discusses related work. Section 6 presents conclusion and future work. Variability representation in product lines using concept lattices 95 2 Formal Concept Analysis and product comparison matrices This section proposes an approach to represent with a concept lattice the vari- ability originally expressed in a PCM. 2.1 Concept lattices and AOC-posets Formal Concept Analysis [8] is a mathematical framework which structures a set of objects described by attributes and highlights the differences and similarities between these objects. FCA extracts from a formal context a set of concepts that forms a partial order provided with a lattice structure: the concept lattice. A formal context is a 3-tuple (O, A, R) where O and A are two sets and R ⊆ O × A a binary relation. Elements from O are called objects and elements from A are called attributes. A pair from R states that the object o possesses the attribute a. Given a formal context K = (O, A, R), a concept is a tuple (E, I) such that E ⊆ O and I ⊆ A. It depicts a maximal set of objects that share a maximal set of common attributes. E = {o ∈ O|∀a ∈ I, (o, a) ∈ R} is the concept’s extent and I = {a ∈ A|∀o ∈ E, (o, a) ∈ R} is the concept’s intent. Given a formal context K = (O, A, R) and two concepts C1 = (E1 , I1 ) and C2 = (E2 , I2 ) from K, C1 ≤ C2 if and only if E1 ⊆ E2 and I2 ⊆ I1 . C1 is a subconcept of C2 and C2 is a superconcept of C1 . When we provide all the concepts from K with the specialisation order ≤, we obtain a lattice structure called a concept lattice. We represent intent and extent of a concept in an optimised way by making el- ements appear only in the concept where they are introduced. Figure 7 represents a concept lattice having simplified intent and extent. We call object-concept and attribute-concept the concepts which introduce respectively at least an object or an attribute. In Figure 7, Concept 7 and Concept 2 introduce neither attributes nor objects: their simplified intent and extent are empty. If they are not nec- essary, we can choose to ignore these concepts. Attribute-Object-Concept poset (AOC-poset) from a formal context K is the sub-order of (CK ,≤) restricted to object-concepts and attribute-concepts. Figure 4 presents the AOC-poset match- ing the concept lattice from Figure 7. In our case, interesting properties of con- cept lattices with regard to variability are preserved in AOC-posets: this smaller structure can be used as an alternative to concept lattices. 2.2 Product Comparison Matrix A PCM describes a set of products from a same family with variable features in a tabular way. Figure 1 presents a PCM which describes four products against two features, taken from Wikipedia. We notice that the cells of this PCM lack of formalisation: in this, different values have the same meaning (Object-Oriented and OO) and it seems that there are no rules on the use of value separator (’,’ or ’&’ or ’and’). 96 Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez Language Standardized Paradigm Java Yes Functional, Imperative, OO Perl No Functional & Procedural & Imperative Php No Functional, Procedural, Imperative and Object-Oriented C# Yes functional, procedural, imperative, Object Oriented Fig. 1. Excerpt of a Product Comparison Matrix on programming languages (http://en.wikipedia.org/wiki/Comparison of programming languages, July 2014) 2.3 Processing PCMs to get formal context If we want to automatically build concept lattices from this kind of descriptions, we need to make the cell values respect a format in order to extract and process them. In [17], authors identify height different types of cell values: – yes/no values that indicate if the criterion is satisfied or not, – constrained yes/no values when the criterion is satisfied under conditions, – single-value when the criterion is satisfied using this value, – multi-values when several values can satisfy the criterion, – unknown value when we do not know if the criterion is satisfied, – empty cell, – inconsistent value when the value is not related to the criterion, – extra information when the cell value offers additional informations. We clean each type of cells as follows. Empty cells are not a problem, even though they indicate a lack of information. Inconsistent values should be de- tected, then clarified or erased. Unknown values and extra informations will be simply erased. Other types of cells could have either one single value or a list of values. We will always use a coma as value separator. Values with the same meanings will be written in the same way. Once we have applied these rules on a PCM, we consider that this PCM is cleaned. A cleaned PCM can own three types of features: – simple boolean, – constrained boolean, – non-boolean. Since automatic process is difficult on original PCMs, we clean them manu- ally. When we clean the PCM in Figure 1, we obtain the PCM in Figure 2. We can now automatically extract values from cleaned PCMs to generate formal contexts. Given a set of objects and a set of binary attributes, a formal context is a binary relation that states which attributes are possessed by each object. In summary, we want to convert a set of multivalued features (PCM) into a set of binary attributes (formal context). Scaling technique [8] consists in creating a binary attribute for each value (or group of values) of a multivalued feature. Boolean features can produce a single attribute to indicate if either or not the object owns this feature. Yet Variability representation in product lines using concept lattices 97 Language Standardized Paradigm Java Yes Functional, Imperative, Object-Oriented Perl No Functional, Procedural, Imperative Php No Functional, Procedural, Imperative, Object-Oriented C# Yes Functional, Procedural, Imperative, Object-Oriented Fig. 2. PCM in Figure 1 cleaned manually because of empty cells, the fact that an object does not possess an attribute could also mean that the cell from the PCM was left blank. To be more precise, we can choose to generate two attributes: one to indicate that the object possesses the feature, and one to indicate that the object does not possess the feature. Constrained boolean features can be processed in the same way than simple boolean features by producing one or two attributes, with the difference that we can keep constraints in the form of attributes. Non-boolean features will simply produce an attribute per value or group of values. We applied scaling technique on the cleaned PCM of Figure 2 and got the formal context in Figure 3. Standardized:Yes Object -Oriented Standardized:No Procedural Imperative Functional Language Java x x x x Perl x x x x Php x x x x x C# x x x x x Fig. 3. Formal context obtained after scaling the cleaned PCM in Figure 2 The structure of concept lattices and AOC-posets permits to highlight inter- esting properties from variability point of view: they classify objects depending on their attributes and emphasise relations between these attributes (e.g. require, exclude). For instance: attributes introduced in the top concept are owned by all objects; attributes which are introduced in the same concept always appear together; if an object o1 is introduced in a sub-concept of a concept introducing an object o2 , o1 possesses all the attributes of o2 and other attributes; two ob- jects introduced in the same concept possess the same attributes. Feature models show part of this information, mainly reduced to relations between attributes, as they do not include the products in the representation. Figure 7 represents the concept lattice from the formal context of Figure 3. Figure 4 represents the AOC-poset from the formal context of Figure 3. In these two structures, we can see that: all languages permit to write programs according 98 Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez to functional and imperative paradigms; the product Php has all the attributes of Perl in addition to the attribute Object Oriented ; all standardised languages are object-oriented. Fig. 4. AOC-poset from the formal context of Figure 3 3 Relational Concept Analysis and interconnected product lattices This section proposes an approach to model the case where a PCM possesses a feature whose value domain corresponds to a set of products described by another PCM. 3.1 Modeling interconnected families of products with RCA We illustrate the modeling of interconnected families of products with an exten- sion of our example. We can find on Wikipedia a PCM on Wikisoftwares that refers to Programming Languages: we want to structure wikisoftwares accord- ing to programming languages in which they are written. We assume a PCM about wikisoftwares that owns a boolean feature Open Source and a constrained Variability representation in product lines using concept lattices 99 boolean feature Spam Prevention. We applied Section 2 approach and obtained the formal (objects-attributes) context in Figure 5. Figure 8 presents the concept lattice associated with the context in Figure 5. Wikisoftware OS:Yes OS:No SP:Yes SP:No SP:Captcha SP:Blacklist TeamPage x x x Socialtext x MindTouch x x DokuWiki x x x EditMe x x x Fig. 5. Objects-attributes context of Wikisoftwares Relational Concept Analysis (RCA) [10] extends FCA to take into account relations between several sets of objects. Each set of objects is defined by its own attributes (in an objects-attributes context) and can be linked with other sets of objects. A relation between two sets of objects is stated by a relational context (objects-objects context). A relational context is a 3-tuple (O1 , O2 , I) where O1 (source) and O2 (target) are two sets of objects such that there are two formal contexts (O1 , A1 , R1 ) and (O2 , A2 , R2 ), and where I ⊆ O1 × O2 is a binary relation. We want to express the relation isWritten between objects of Wikisoftwares and objects of Programming languages. TeamPage and EditMe are written in Java, SocialText in Perl, DokuWiki in Php and Mindtouch in both Php and C#. We link each wikisoftware with corresponding programming languages in an objects-objects context, and we present it in Figure 6. isWritten Java Perl Php C# TeamPage x Socialtext x MindTouch x x DokuWiki x EditMe x Fig. 6. Objects-objects context expressing the relation between objects of Wikisoft- wares and Programming languages 3.2 Processing interconnected families of products Given an objects-objects context Rj = (Ok , Ol , Ij ), there are different ways for an object from Ok domain to be in relation with a concept from Ol . For instance: an object is linked (by Ij ) to at least one object of a concept’s extent (existential 100 Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez Fig. 7. Concept lattice of Programming languages, step 0 (built with RCAExplore: http://dolques.free.fr/rcaexplore.php) scaling); an object is linked (by Ij ) only to objects of a concept’s extent (universal scaling). For each relation of R, we specify which scaling operator is used. In RCA, objects-attributes contexts are extended according to objects-objects contexts to take into account relations between objects of different sets. For each objects-objects context Rj = (Ok , Ol , Ij ), RCA extends the objects-attributes context of the set of objects Ok by adding relational attributes according to concepts of the lattice associated with the objects-attributes Ol . Each concept c from Ol gives a relational attribute q r :c where q is a scaling operator and r is the relation between Ok and Ol . A relational attribute appears in a lattice just as the other attributes, with the difference that it can be considered like a reference to a concept from another lattice. As shown on the example, data are represented in a Relational Context Fam- ily (RCF), which is a tuple (K, R) such that K is a set of objects-attributes contexts Ki = (Oi , Ai , Ii ), 1 ≤ i ≤ n and R is a set of objects-objects contexts Variability representation in product lines using concept lattices 101 Fig. 8. Concept lattice of Wikisoftwares, step 0 Rj = (Ok , Ol , Ij ), 1 ≤ j ≤ m, with Ij ⊆ Ok × Ol . Given an objects-attributes context K = (O, A, I), we define rel(K) the set of relations (objects-objects con- texts) of R which have O for domain, and ρ a function which associates a scaling operator to each objects-objects context of R. For each step, we extend the con- text K by adding relational attributes from each context of rel(K): we obtain the complete relational extension of K. When we compute the complete relational extension of each context of K, we obtain the complete relational extension of the RCF. RCA generates a succession of contexts and lattices associated with the RCF (K, R) and ρ. In step 0, RCA generates lattices associated with contexts of K. K 0 = K. In step e + 1, RCA computes complete relational extension of the K e contexts. The obtained extended contexts (K e+1 ) possess relational attributes which refer to concepts of lattices obtained in the previous step. In our example, the RCF is composed of Wikisoftware, Programming Lan- guage and of the objects-objects context isWritten. When we compute the com- plete relational extension of this RCF, we extend the objects-attributes context of Wikisoftware with relational attributes which refer to each concept of Pro- gramming language. Figure 7 and Figure 8 represent lattices of Wikisoftware and Programming language at step 0. Figure 9 presents the concept lattice from the 102 Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez extended objects-attributes context of Wikisoftwares, at step 1. In this example, we cannot go further than step 1. Fig. 9. Concept lattice of Wikisoftwares, step 1 In Figure 9, relational attributes are read like references to concepts in the lattice of Programming languages at step 0 (Figure 7). An extended concept lattice gives us the same kind of informations that are emphasised in a basic concept lattice, but it takes into account attributes from other product families. This brings a new dimension to research and classification of products from a same family. In our example, it permits us to select a wikisoftware depending on the paradigm of its programming language. In Figure 9, we can read for instance: – DokuWiki (concept 1) is written in a programming language characterised by concepts 6, 7, 8, 3, 5 and 0 of Programming languages, corresponding Variability representation in product lines using concept lattices 103 to attributes Standardized:No, Procedural, Functional, Object Oriented and Imperative; – Team Page and EditMe are equivalent products because they are introduced in the same concept (same attributes and same relational attributes); – a wikisoftware not Open Source is written in a standardised language; – an Open Source wikisoftware is written in an unstandardised language. 4 Assessing scalability of the approach Until now, we proposed a first method to obtain a concept lattice from a PCM’s description and a second method to depict the case where features possess their own PCM with interconnected lattices. In this section, we evaluate these two methods on existing PCMs from Wikipedia to get an order of magnitude of the obtained structures size. The number of generated concepts from a formal context depends on the number of objects, the number of attributes and the form of the context: this number can reach 2min(|O|,|A|) with a lattice, and |O| + |A| for an AOC-poset. In the following tests, we generate both concept lattices and AOC-posets to emphasise the impact of deleting concepts which introduce neither attributes nor objects on the size of the structure. Each test was made twice, a first time with full formal contexts (scaling technique giving two attributes for each boolean feature, and keeping constraints in the form of attributes for constrained boolean features) and a second time with reduced fomal contexts (scaling technique giving one attribute for each boolean feature). 4.1 Scalability for non-connected PCMs (FCA) Firstly, we analysed 40 PCMs from Wikipedia without taking into account rela- tions between products. These 40 PCMs were converted into formal contexts with the method of Section 2. Results are presented in Figure 10. We have analysed 1438 products, generated 4639 binary attributes and 26002 formal concepts. Most of the concept lattices possess between 50 and 300 concepts, but some of them can reach about 5000 concepts: this number is very high, and it would be difficult to quickly process these data. Reduced contexts (blue plots) give smaller structures, but some of them remain considerable. Thus, results of AOC-posets are encouraging: the highest number of concepts obtained is 161. Most of them possess between 30 and 60 concepts. 4.2 Scalability for connected PCMs (RCA) Secondly, we made the same type of tests on structures which have been extended with a relation, using method of Section 3. We wanted to realise these tests on a quite important number of relationships. Yet, it is simple to find PCMs on Wikipedia but it is more difficult to automatically find relationships between these PCMs. To simulate relations between PCMs we used in the first test, 104 Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez Lattices (F) Lattices (R) AOC-posets (F) AOC-posets (R) Mean 427.375 131.35 50.3 41.025 Median 174 62.5 50 34 Minimum 21 8 16 8 Maximum 4551 1336 161 148 First Quartile 49 28.25 29.25 20.5 Third Quartile 282.5 127.75 61.25 53.75 Fig. 10. Box plots on the number of concepts obtained with concept lattices (top) and AOC-poset (bottom), from full formal contexts (green) and reduced formal contexts (blue) we choose to automatically generate random objects-objects contexts based on two existing relations we found on Wikipedia. We analysed these relations and found out that about 75% of objects are linked to at least another object and that 95% of linked objects are linked to a single other object. We formed 20 pairs of objects-attributes contexts and generated object-objects contexts for each pair according to our analysis. Results are presented in Figure 11. Concept lattices are huge (some can reach 14000 concepts) whereas AOC-poset remain relatively affordable (about 200 concepts). These results match with the products used in a very simplified form for illustrating the approach. In real data, the first existing relation is between Lin- uxDistribution (77 objects, 25 features) and FileSystem (103 objects, 60 fea- tures). With a lattice, we obtain 1174 concepts (full context) and 1054 con- cepts (reduced context). With an AOC-poset, we obtain 188 then 179 con- cepts. The second relation is between Wikisoftware (43 objects, 12 features) and Programming Language (90 objects, 5 features). With a lattice, we obtain 282 concepts (full context) and 273 concepts (reduced context). With an AOC- poset, we obtain 87 and then 86 concepts. All the datasets (extracted from Variability representation in product lines using concept lattices 105 Lattices (F) Lattices (R) AOC-posets (F) AOC-posets (R) Mean 2263.5 878.2 89.95 72.55 Median 793 194 82 61.5 Minimum 61 24 33 16 Maximum 14083 7478 201 189 First Quartile 359.25 108.25 67.5 47.25 Third Quartile 2719.5 597.75 101.25 87.25 Fig. 11. Box plots on the number of concepts obtained with formal contexts extended with RCA wikipedia and generated) are available for reproducibility purpose at: http: //www.lirmm.fr/recherche/equipes/marel/datasets/fca-and-pcm. According to these results, we can deduce that the concept lattices obtained possess a majority of empty concepts (which introduce neither attributes nor objects): AOC-poset appears like a good alternative to generate less complex structures, and keeps variability informations in the same way that concept lat- tices. These results on product description datasets, that are either real datasets, or datasets generated with respect to existing real one profile, allow us to think that is is realistic to envision using FCA and RCA for variability representation within a canonical form integrating features and products. 5 Related work To our knowledge, the first research work using Formal Concept Analysis to analyse relationships between product configurations and variable features to assist construction of a product line has been realised in Loesh and Ploederer [13]. In this approach, the concept lattice is analysed to extract informations about groups of features always present, never present, always present together or never present together. Authors use these informations to describe constraints 106 Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez between features and propose restructurations of these features (merge, removal or identification of alternatives feature groups). In [14], authors study, in an aspect-oriented requirements engineering context, a concept lattice which clas- sifies scenarios by functional requirements. They analyse on the one hand the relation between concepts and quality requirements (usability, maintenance), and on the other hand interferences between quality requirements. Also, they analyse the impact of modifications. This analysis has been extended by obser- vations about product informations contained in the lattice (or the AOC-poset), and on the manner that some work implicitly use FCA, without mentioning it [1]. In the present work, we show how to extend the original approach, which analyses products described by features, to a more general case where there are features that are products themselves. Moreover, we evaluate the scale up of FCA and RCA on product families described by PCMs from Wikipedia and linked by relationships randomly generated. In the domain of product lines, another category of works is interested in identification of features in source code using FCA [19, 3]. In this case, described entities are variants of software systems which are described by source code and authors try to produce groups of source code elements that can be candidates to be features. Some works search for traceability links between features and the code [16]. In [7], authors cross-reference source code parts and scenarios which execute them and use features. The goal is to identify parts of the code which correspond to feature implementation. In our case, we do not work on source code, nor with scenarios, but with existing product descriptions. Finally, a last category of works study feature organisation in FM with FCA. Some approaches [20, 15] use conceptual structures (concept lattice or AOC- poset) to recognise contraints, but also to suggest sub-features typed relations linked to the domain. In the article [15], authors study in detail the extraction of implication rules in the lattice and cover relationships, to determine for in- stance if a set of features covers all the products. Recent works [2, 18] focus on emphasise logical relationships between features in a FM and on the identifica- tion of transverse constraints. These logical relationships are more specifically used in [18] to analyse the variability of a set of architectural configurations. In the present work, we generate a structure or several interconnected structures. These structures are created to analyse variability, but we do not consider issues about FM construction. 6 Conclusion In this paper, we analyse the feasibility of using Formal Concept Analysis and Concept Lattices as a complement to Product Comparison Matrices to represent variability in product lines. We propose a method to convert a description from a product comparison matrix to a concept lattice using the scaling technique. We also propose a way to model the special case where a product family is de- scribed by another product family with Relational Concept Analysis. We obtain Variability representation in product lines using concept lattices 107 interconnected lattices that bring a new dimension to research and classification of products when they are in relation with other product families. Subsequently, we realise series of tests to determine an order of magnitude of the number of concepts composing the structures obtained firstly with FCA by converting PCMs into formal contexts, and secondly with RCA by introducing relations between these contexts. In these tests, we compare two structures: concept lattices which establish a sub-order among all the concepts and AOC- posets which establish a sub-order among the concepts which introduce at least an attribute or an object. It seems that most of the concepts do not introduce any information and AOC-poset appears like a more advantageous alternative, in particular for presenting information to an end-user. Concept lattices are useful too, when they are medium-size, and for automatic data manipulations. We also show the effect of two different encoding of boolean values. In the future, we will use the work of [4] to automatise as much as possible the conversion of PCMs. Moreover, researches on the classification process (using different scaling operators) and its applications on variability will be performed to complete this analysis. Also, a detailed study of the possibilities offered by RCA to model other cases is considered. Finally, it could be interesting to study transitions between different structures like FMs, PCMs, AOC-posets and con- cept lattices to be able to select one depending on the kind of operation we want to apply. References 1. Al-Msie ’deen, R., Seriai, A.D., Huchard, M., Urtado, C., Vauttier, S., Al-Khlifat, A.: Concept lattices: A representation space to structure software variability. In: ICICS 2014: The fifth International Conference on Information and Communica- tion Systems. pp. 1 – 6. Irbid, Jordan (Apr 2014), http://hal-lirmm.ccsd.cnrs. fr/lirmm-01075533 2. Al-Msie’deen, R., Huchard, M., Seriai, A., Urtado, C., Vauttier, S.: Reverse engi- neering feature models from software configurations using formal concept analysis. In: Proceedings of the Eleventh International Conference on Concept Lattices and Their Applications, Košice, Slovakia, October 7-10, 2014. pp. 95–106 (2014) 3. Al-Msie’deen, R., Seriai, A., Huchard, M., Urtado, C., Vauttier, S., Salman, H.E.: Mining features from the object-oriented source code of a collection of software variants using formal concept analysis and latent semantic indexing. In: Proc. of The 25th SEKE. pp. 244–249 (2013) 4. Bécan, G., Sannier, N., Acher, M., Barais, O., Blouin, A., Baudry, B.: Automating the formalization of product comparison matrices. In: Proc. of the 29th ACM/IEEE ASE ’14. pp. 433–444 (2014) 5. Clements, P.C., Northrop, L.M.: Software product lines: practices and patterns. Addison-Wesley (2001) 6. Czarnecki, K., Eisenecker, U.W.: Generative programming: methods, tools, and applications. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA (2000) 7. Eisenbarth, T., Koschke, R., Simon, D.: Locating features in source code. IEEE Trans. Softw. Eng. 29(3), 210–224 (2003) 108 Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez 8. Ganter, B., Wille, R.: Formal concept analysis - mathematical foundations. Springer (1999) 9. Griss, M.L., Favaro, J., Alessandro, M.d.: Integrating Feature Modeling with the RSEB. In: Proceedings of the 5th International Conference on Software Reuse. pp. 76–. ICSR ’98, IEEE Computer Society, Washington, DC, USA (1998), http: //dl.acm.org/citation.cfm?id=551789.853486 10. Hacène, M.R., Huchard, M., Napoli, A., Valtchev, P.: Relational concept analysis: mining concept lattices from multi-relational data. Ann. Math. Artif. Intell. 67(1), 81–108 (2013) 11. Kang, K.C., Cohen, S.G., Hess, J.A., Novak, W.E., Peterson, A.S.: Feature-oriented domain analysis (foda) feasibility study (November 1990) 12. Kang, K.C., Kim, S., Lee, J., Kim, K., Shin, E., Huh, M.: Form: A feature-oriented reuse method with domain-specific reference architectures. Ann. Software Eng. 5, 143–168 (1998) 13. Loesch, F., Ploedereder, E.: Restructuring variability in software product lines us- ing concept analysis of product configurations. In: Proc. of the 11th IEEE ECSMR. pp. 159–170 (2007) 14. Niu, N., Easterbrook, S.M.: Concept analysis for product line requirements. In: Proc. of the 8th AOSD 2009. pp. 137–148 (2009) 15. Ryssel, U., Ploennigs, J., Kabitzsch, K.: Extraction of feature models from formal contexts. In: Proc. of ACM SPLC ’11. pp. 4:1–4:8 (2011) 16. Salman, H.E., Seriai, A., Dony, C.: Feature-to-code traceability in a collection of software variants: Combining formal concept analysis and information retrieval. In: Proc. of the14th IEEE IRI. pp. 209–216 (2013) 17. Sannier, N., Acher, M., Baudry, B.: From comparison matrix to variability model: The wikipedia case study. In: Proc. of the 28th IEEE/ACM ASE 2013. pp. 580–585 (2013) 18. Shatnawi, A., Seriai, A.D., Sahraoui, H.: Recovering architectural variability of a family of product variants. In: To appear in Proc. of the 14th ICSR (2015) 19. Xue, Y., Xing, Z., Jarzabek, S.: Feature location in a collection of product variants. In: Proc. of the 19th IEEE WCRE. pp. 145–154 (2012) 20. Yang, Y., Peng, X., Zhao, W.: Domain feature model recovery from multiple ap- plications using data access semantics and formal concept analysis. In: Proc. of the 16th IEEE WCRE. pp. 215–224 (2009)