=Paper= {{Paper |id=None |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-1434/proceedings-fcaa.pdf |volume=Vol-1434 }} ==None== https://ceur-ws.org/Vol-1434/proceedings-fcaa.pdf
FCA&A 2015
Proceedings of the International
Workshop on Formal Concept Analysis and Applications
(co-located with ICFCA 2015)
Universidad de Málaga (Dept. Matemática Aplicada), Spain

ISBN 978-84-606-7410-8
The 13th Intl Conf on Formal Concept Analysis



Formal Concept Analysis
     and Applications

        FCA&A 2015

           Nerja (Málaga), Spain
            June 23–26, 2015




            (an ICFCA workshop)




                   Edited by

              Manuel Ojeda-Aciego
               Jaume Baixeries
               Christian Sacarea
Proceedings of FCA&A 2015 (co-located with ICFCA 2015)
c Manuel Ojeda-Aciego, Jaume Baixeries, Christian Sacarea, Editors




This work is subject to copyright. All rights reserved. Reproduction or publica-
tion of this material, even partial, is allowed only with the authors’ and editors’
permission.




Technical Editor:
Manuel Ojeda-Aciego, aciego@uma.es
                                         Table of Contents



Preface
Depth-first Search for Pseudo-intents through Minimal Transversals . . . . .                                                  1
Alexandre Bazin
Edit Operations on Lattices for MDL-based Pattern Summarization . . . . .                                                    17
Keisuke Otaki and Akihiro Yamamoto
An Approach Towards Classifying and Navigating RDF data based on
Pattern Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           33
Mehwish Alam and Amedeo Napoli
Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors                                                     49
Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia
Missaoui
Context-Dependent Deductive and Inductive Reasoning Based on Good
Diagnostic Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         65
Xenia Naidenova and Vladimir Parkhomenko

Concept Lattices of RDF Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                         81
Jens Kötters
Variability representation in product lines using concept lattices:
feasibility study with descriptions from Wikipedia’s product comparison
matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   93
Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez
                                  Preface


Formal Concept Analysis (FCA) is a mathematical field rooted in lattice and
order theory which, although being of such a theoretical nature, has proved to
be of interest to various applied fields such as Knowledge Discovery and Data
Mining, Database Theory, Data Visualization, and many others.
The 13th International Conference on Formal Concept Analysis (ICFCA) took
place from the 23th to the 26th of June, 2015 in Nerja (Málaga), Spain, organized
by members of the Universidad de Málaga. The International Workshop on For-
mal Concept Analysis and Applications (FCA&A) was organized in conjunction
to ICFCA, and consisted of the seven works included in this volume.
Our deepest gratitude goes to all the authors of submitted papers, the Program
Committee members, and the external reviewers. Working with the efficient and
capable team of local organizers was a constant pleasure. We are deeply indebted
to all of them for making this conference a successful forum on FCA.
Last, but not least, we are most grateful to the organizations that sponsored
this event: the Universidad de Málaga, Andalucı́a Tech (International Campus
of Excellence), the Patronato de Turismo de la Costa del Sol and the Área
de Turismo del Ayuntamiento de Nerja, all in Spain. Finally, we would like to
emphasize the great help of EasyChair for making the technical duties easier.




June 2015                                                  Manuel Ojeda-Aciego
                                                               Jaume Baixeries
                                                             Christian Sacarea
          Depth-first Search for Pseudo-intents through
                      Minimal Transversals

                                        Alexandre Bazin

                                  contact@alexandrebazin.com



             Abstract. The enumeration of pseudo-intents is a long-standing prob-
             lem in which the order plays a major role. In this paper, we present new
             algorithms that enumerate pseudo-intents in orders that do not necessar-
             ily respect the inclusion relation. We show that, confirming established
             results, this enumeration is equivalent to computing minimal transversals
             in hypergraphs a bounded number of times.


      1    Introduction

      Formal concept analysis, as a field of applied mathematics, is interested in the
      patterns that can be found in data taking the form of a set of objects described
      by attributes. Among these patterns are the implications, relations between at-
      tribute sets A and B representing the fact that every object described by A is
      also described by B. As often with this type of relation, interest revolves around
      a small, non redundant set of implications. Such a set, the Duquenne-Guigues
      basis, is constructed using the notion of pseudo-intent. Indeed, computing the
      Duquenne-Guigues basis, and thus all the implications in a data set, is equiva-
      lent to enumerating all the pseudo-intents. In [15], it is shown that the number
      of pseudo-intents can be exponential in the number of attributes. It has been
      proven in [6, 2] that pseudo-intents cannot be enumerated with a polynomial
      delay in lectic or reverse lectic order. However, to the best of our knowledge,
      no such result exists for other orders. The best-known algorithm, Next Closure
      [11], enumerates only in the lectic order. In a previous work [4], we proposed an
      exponential delay algorithm to enumerate in any order that extends the inclu-
      sion order. In order to continue the study of the influence of the order on the
      complexity of this enumeration problem, we propose here two new algorithms
      that, together, can enumerate pseudo-intents without respecting the inclusion.
          Our algorithms enumerate pseudo-intents incrementally, i.e. given a formal
      context and a set of previously found implications, they return a new pseudo-
      intent. In order to do this, we define new conditions under which a pseudo-intent
      P can be recognized using the lower covers of P in the lattice of attribute sets
      closed under the implications already found. This allows us to build algorithms
      that, instead of starting from ∅ and adding attributes to construct sets, start
      from > and remove attributes, resulting in a depth-first search in the lattice.
      We show that both constructing and recognizing a pseudo-intent can be done




c 2015 for the individual papers by the papers’ authors. Copying permitted for private
  and academic purposes. This volume is published and copyrighted by its editors.
  Local Proceedings also appeared in ISBN 978-84-606-7410-8, Universidad de Málaga
  (Dept. Matemática Aplicada), Spain.
2       Alexandre Bazin

    by computing a bounded number of times the minimal transversals of some
    hypergraph, which makes it easy to bound the delay and isolate special, easier
    cases. This result establishes a link with the proof in [6] that enumerating pseudo-
    intents without order is at least as hard as computing minimal transversals.
       We start by recalling in Section 2 the basic definitions of FCA and results
    on the enumeration of pseudo-intents and minimal transversals. In Section 3, we
    present the definition of a recognizable pseudo-intent that we use along with the
    properties needed for our algorithms. Finally, Sections 4 and 5 are dedicated to
    the two algorithms, an example and a brief study of their delay.


    2     Preliminaries
    2.1     Basics
    In formal concept analysis, data takes the form of a formal context. A formal
    context is a triple C = (O, A, R) in which O is a set of objects, A is a set of
    attributes and R ⊆ O × A a relation that maps attributes to objects. An object
    o ∈ O is said to be described by an attribute a ∈ A if (o, a) ∈ R. Together with
    the context, there are two ·0 operators defined as

                                       ·0 : 2O 7→ 2A

                            O0 = {a ∈ A | ∀o ∈ O, (o, a) ∈ R}

                                       ·0 : 2A 7→ 2O

                            A0 = {o ∈ O | ∀a ∈ A, (o, a) ∈ R}
       Their composition ·00 is a closure operator. A set of attributes A = A00 closed
    under ·00 is called an intent.


    2.2     Implications
    An implication is a relation between two attribute sets A and B, noted A → B.
    It is said to hold in the context if and only if A0 ⊆ B 0 or, in other words, if and
    only if every object described by A is also described by B. A set I of implications
    is a basis if and only if every implication that holds in the context can be derived
    from I using Armstrong’s rules :
                          B⊆A   A→B      A → B, B → C
                              ,        ,
                          A→B A∪C →B∪C      A→C

       A pseudo-intent (or pseudo-closed set) P is a set of attributes that contains
    the closure of all its subsets and is not an intent. The set B = {P → P 00 | P
         Depth-first Search for Pseudo-intents through Minimal Transversals         3




                               Fig. 1. A formal context


is a pseudo-intent} is the implication basis with the minimum cardinality and
is called the Duquenne-Guigues basis [12]. As such, computing the Duquenne-
Guigues basis, and thus all the implications, is enumerating all the pseudo-
intents. As one of the great problems in FCA, it has been abundantly studied
[12, 15, 6, 17, 4, 3, 16, 1, 19]. As the number of pseudo-intents can be exponential
in the size of the context, preventing a polynomial algorithm, the attention is
instead focused on the delay, i.e. the time between two pseudo-intents and the
last pseudo-intent and the end of the algorithm. It has been shown that pseudo-
intents cannot be enumerated with a polynomial delay in lectic [6] or reverse
lectic order [2]. For the enumeration without order, it is shown in [6] that it is at
least as hard as computing the minimal transversals of a hypergraph. However,
the upper bound is not yet known. The problem of recognizing a pseudo-intent
has been studied in [1] and found to be coNP-complete.

2.3     Logical Closures

A set of implications I gives rises to a logical closure operator I(·) defined as

                      A+ = A ∪ {C | B → C ∈ I and B ⊆ A}
                         I(A) = A++...+ (up to saturation)

      Similarly, we have the logical pseudo-closure operator I − (·) defined as

                       A◦ = A ∪ {C | B → C ∈ I and B ⊂ A}
                         I − (A) = A◦◦...◦ (up to saturation)

   It is known that intents are closed under both B − (·) and B(·), whereas a
pseudo-intent P is closed under B − (·) and I(·) for I ⊆ B \ {P → P 00 }.
    An attribute set Y is said to be a generator of an attribute set X under
I(·) (resp. I − (·)) if I(Y ) = X (resp. I − (Y ) = X). It is a minimal generator
4    Alexandre Bazin




                         Fig. 2. ΦI with I = {de → bde, be → bde}


    if there is no generator Z of X such that Z ⊂ Y . We will use GenI (X) to
    denote the set of all minimal generators of a set X under I(·). The complexity
    of computing the minimal generators of a set has been studied in [14]. The
    number of minimal generators can be exponential in the number of attributes
    and they can be enumerated with an incremental polynomial delay.
        We shall use ΦI to denote the lattice of attribute sets closed under I(·)
    ordered by the inclusion relation. For an attribute set A, the set of pseudo-
    intents P in I such that P ⊆ A and P 00 6⊆ A will be denoted by PI (A).


    2.4   Minimal Transversals

    The problem of computing minimal transversals in hypergraphs or, equivalently,
    the prime conjunctive normal form of the dual of a Boolean function (otherwise
    called monotone dualization), have been largely studied in the literature [13, 10,
    8, 9, 7].
        Given a hypergraph H = (E, V) where E is a set of edges and V ⊆ 2E is a set
    of vertices, a transversal T ⊆ E is a set of edges such that ∀V ∈ V, T ∩ V 6= ∅. A
    transversal is minimal if none of its subsets is a transversal. We shall use T r(V )
    to denote the set of minimal transversals of a set of vertices V . For example,
    T r({ace, bce, cde}) = {c, e, abd}.
       The problem of computing all the minimal transversals of a hypergraph can
    be solved in quasi-polynomial total time [9, 10, 13]. It is currently not known
       Depth-first Search for Pseudo-intents through Minimal Transversals           5

whether it is possible to enumerate the transversals with a polynomial delay in
the general case. However, many special cases are known to have an output-
polynomial delay. Their references can be found in [9].
    In this work, we cannot presuppose the nature of the hypergraph and so we
are interested in the general case. The problem still being actively studied, we
will suppose that we have a blackbox algorithm nextTrans that enumerates
the minimal transversals of a set of vertices X. We suppose, for ease of writing,
that the transversals Ti this algorithm enumerates in some arbitrary order T1 <
T2 < ... < Tn with nextTrans(Ti , X) = Ti+1 and nextTrans(Tn , X) = ∅.
For example, Berge Multiplication [5] can be adapted to take the role of
nextTrans. It is important to note that a first transversal can be computed
in polynomial time by simply removing attributes until the set stops being a
transversal.

3   Recognizing a Pseudo-Intent
We know that an attribute set is a pseudo-intent if and only if it contains the
closure of all its subsets that are pseudo-intents. As for the problem of recognizing
pseudo-intents, two cases have been studied. When only a context and a set A
are provided, checking whether A is a pseudo-intent is coNP-complete [1]. An
algorithm has been proposed in [19] that runs in O(2|A| ). When the context and
A are provided alongside all the pseudo-intents (and thus implications) contained
in A, checking whether A is a pseudo-intent is easier as we only have to verify the
closedness of A for the logical closure and ·00 , which implies a runtime polynomial
in the size of the implication set. It is this method that is used in algorithms
such as Next Closure or the one we proposed in [4] and its drawback is that it
forces an enumeration in an order that extends the inclusion order.
    In this paper, we are interested in enumerating pseudo-intents in orders that
do not extend the inclusion and, as such, we cannot suppose that we know the
closure of all the subsets of a set before attempting to recognize its pseudo-
closedness. Hence, we propose to consider the problem of recognizing a pseudo-
intent A given a context and a set of subsets of A that is not necessarily complete.

Proposition 1. An attribute set A ∈ ΦI is a pseudo-intent if A 6= A00 and all
its lower covers in ΦI are closed.
Proof. Let us suppose that A is not closed and that all its lower covers are
closed. Let B be a lower cover of A and X be a subset of A. If B ⊂ X, then
I(X) = A and X 00 = A00 . If X ⊆ B and X 00 6⊆ B, then B cannot be closed and
we have a contradiction. Therefore, X 00 ⊆ B. Since the closure of every subset
of A is either A00 or contained in A, the set A is a pseudo-intent. 
   This proposition states that some pseudo-intents can be recognized in ΦI only
by looking at their lower covers. As such, when I is a subset of the Duquenne-
Guigues basis, a new pseudo-intent can be found by looking in ΦI for sets that
only have intents as lower covers.
6    Alexandre Bazin

    Definition 1. A pseudo-intent that can be recognized using Proposition 1 with
    a set of implication I is said to be recognizable. The set of all the implications
    that are recognizable with I is denoted by Rec(I).
    Proposition 2. ∀I ⊂ B, Rec(I) 6= ∅
    Proof. Let P be minimal among pseudo-intents that are not premises of impli-
    cations in I. If there is a lower cover X of P in ΦI that is not closed, then there
    is a set A ⊂ X in ΦI such that A → A00 holds in the context. If A00 ⊆ P , then
    P is not minimal. If A00 6⊆ P , then P is not a pseudo-intent. Both cases lead to
    contradictions so all the lower covers of P are closed and P is recognizable from
    I. Since we have I ⊂ B, the set of pseudo-intents that are not a premise of I is
    not empty, so it has minimal elements. Hence, the set of pseudo-intents that are
    recognizable from I is not empty if I ⊂ B. 
        We have that, even though some unknown pseudo-intents may not be recog-
    nizable, there are always additional recognizable pseudo-intents for any subset
    of the Duquenne-Guigues basis. This ensures that we are able to enumerate all
    the pseudo-intents with an algorithm that finds recognizable ones.

    Proposition 3. Let A and B be two elements of ΦI . The set B is a lower cover
    of A if and only if B = A \ C with C a minimal transversal of GenI (A).
    Proof. Let C be minimal such that ∀G ∈ GenI (A), G∩C 6= ∅. For any attribute
    i ∈ A \ C, the set (A \ C) ∪ {i} = A \ (C \ {i}) contains a minimal generator
    of A because of the minimality of C. Therefore, any B between A \ C and A is
    such that I(B) = A and cannot be in ΦI . Hence, A \ C is a lower cover of A.
        Let B be a lower cover of A in ΦI . By definition, I(B ∪ {i}) = A for any
    attribute i ∈ A \ B so there is a subset C of A such that i ∈ C and (C \ {i}) ⊆ B
    that is a minimal generator of A. 
        This proposition states that the lower covers of a set can be computed from
    and depend on its minimal generators. As such, the number of lower covers can be
    exponential in the number of minimal generators which can itself be exponential
    in the size of the attribute set [14].

    Definition 2. For any two attribute sets A and B, we say that B is reachable
    from A if and only if B ⊂ A and there is an ordering x1 < x2 < ... < xn of the
    elements of A \ B such that, for every m < n, A \ {x1 , x2 , ..., xm } is not closed
    under I.
        In other words, B is reachable from A if you can obtain B by removing at-
    tributes from A and only go through sets that are not closed under I. Evidently,
    minimal reachable sets are elements of ΦI .

    Proposition 4. An attribute set A ∈ ΦI is a minimal recognizable pseudo-
    intent if and only if A is not an intent and all the minimal attributes sets reach-
    able from A are intents.
       Depth-first Search for Pseudo-intents through Minimal Transversals           7

Algorithm 1 reachableSets(A)
Require: An attribute set A, an implication set I
 1: R = ∅
            S i ∈ A do
 2: for every
 3:   G = P ∈PI (A\{i}) GenI (P )
 4:   C = The first transversal of G
 5:   while C 6= ∅ do
 6:      B = A \ ({i} ∪ C)
 7:      if PI (B) 6= ∅ then
 8:         R = R ∪ reachableSets(B)
 9:      else
10:         R = R ∪ {B}
11:      end if
12:      C = nextT rans(C, G)
13:   end while
14: end for
15: Return R



Proof. If A is a minimal recognizable pseudo-intent, then every subset of A in
ΦI is an intent. Thus every minimal reachable set is an intent.
    Let us suppose that every minimal reachable subset of A is an intent and
A 6= A00 . For any set X reachable from A and any attribute set P ⊆ X, if
P 00 6⊂ A, then X cannot be an intent. Every minimal reachable subset of A
being an intent, A contains the closure of all its subsets so it is a pseudo-intent.
If P is a pseudo-intent that is not in I, subsets of A \ {i} with i ∈ P 00 \ P cannot
be intents so either P or a superset of P in ΦI that is not an intent is reachable
from A. Hence, A is a minimal recognizable pseudo-intent. 
    A minimal recognizable pseudo-intent can therefore be recognized by com-
puting the sets of its minimal reachable subsets. These minimal reachable sets
can be computed using Algorithm 1. By definition, reachable sets can be com-
puted by removing from A attributes that play a role in the logical closure, i.e.
attributes in minimal generators of pseudo-intents P ⊂ A such that P 00 6⊆ A \ X
for some A \ X reachable from A. The minimal transversals T make up all the
possible ways to remove minimal generators of pseudo-intents and, thus, all the
sets T such that for every Y ⊂ T , A \ Y cannot be logically closed. As such,
removing minimal transversals of the generators of pseudo-intents used in the
logical closure of a set produces a reachable set. The set A \ T is not always
logically closed so the algorithm is recursively applied until an element of ΦI is
reached.

     The algorithm is not optimal as some reachable sets can be computed mul-
tiple times and the sets reachable from a set A that is not an element of ΦI
could be computed directly from PI (A) instead of its direct subsets. However,
it is sufficient to give us an upper bound of the runtime. For each recursive call,
the algorithm computes PI (A \ {i}) for every attribute i ∈ A. For each of these
8       Alexandre Bazin
                                                             S
    attributes, it then computes the minimal transversals of P ∈PI (A\{i}) GenI (P ).
    There is another recursive call for each pseudo-intent found missing so the run-
    time is bounded by O(|A| × |I| × T ) where T is the complexity of computing
    the transversals.

      In the rest of this paper, we will suppose that we have an algorithm nex-
    tReach that enumerates minimal reachable sets in the same fashion as next-
    Trans.


    4     Computing a Recognizable Pseudo-Intent

    4.1     Algorithm

    We want to incrementally compute the pseudo-intents in a formal context. In
    the beginning, we only have the formal context (O, A, R) and an empty set of
    implications I. The corresponding ΦI is the boolean lattice of subsets of A. In
    order to find a first pseudo-intent, we can look for a recognizable set (Proposition
    1) in ΦI . We know that a non-closed set contains a pseudo-intent. Thus, we can
    start with A and recursively remove attributes as long as we obtain non-closed
    sets. When we cannot remove attributes without obtaining an intent, the set is
    a recognizable pseudo-intent and we have a first implication. This corresponds
    to the algorithm presented in [6] that computes a first pseudo-intent. We can
    then generalize the algorithm for other pseudo-intents. As I contains a new
    implication, sets disappear from ΦI and we now have to look in the new lattice
    for a second pseudo-intent. As it is not a Boolean lattice anymore, we cannot
    simply remove attributes to obtain lower covers so we have to use Proposition
    3 to compute them. By going through the lattice using non-closed lower covers
    of sets until we find a set respecting Proposition 1, we can compute a second
    pseudo-intent. Pseudo-intents can thus be computed incrementally until A no
    longer has any non-closed lower cover. Algorithm 2, given a context, a subset I
    of the Duquenne-Guigue Basis and an attribute set A ∈ ΦI , uses this method
    to compute a new pseudo-intent P ⊆ A.


    Proposition 5. nextP I(A, I) returns either A or a pseudo-intent that is not
    in I.

    Proof. Let X be a set in ΦI . If X 00 6= X, then there is a pseudo-intent P ⊆ X.
    If a lower cover of X is not closed, then there is a pseudo-intent X ⊂ S. The
    algorithm returns the first set it finds that has all its lower covers closed. If this
    set is not A, then it is not closed and is thus a pseudo-intent. 

        Algorithm 3 uses Algorithm 2 to enumerate pseudo-intents iteratively until
    it returns A. It is sound but not complete, as exemplified below.
       Depth-first Search for Pseudo-intents through Minimal Transversals          9

Algorithm 2 nextP I(A, I)
Require: Attribute set A, Implication set I
 1: P = A
 2: G = GenI (A)
 3: C = The first transversal of G
 4: while C 6= ∅ do
 5:   S =A\C
 6:   if S 00 6= S then
 7:      P = nextP I(S)
 8:      C=∅
 9:   else
10:      C = nextT rans(C, G)
11:   end if
12: end while
13: RETURN P

Algorithm 3 allP I
1: I = ∅
2: A = nextP I(A, I)
3: while A 6= A do
4:   I = I ∪ {A → A00 }
5:   A = nextP I(A, I)
6: end while
7: Return I



4.2   Example

Let us consider an arbitrary context for which the Duquenne-Guigues basis is
{de → bde, bc → bcd, ac → abcd, be → bde, bcde → abcde}. We want to find its
pseudo-intents using Algorithm 3 :
    The set of implications is initially empty. We start with abcde.
    It has a single minimal generator, itself. The first minimal transversal of
Gen(abcde) is a. The set abcde \ a = bcde is not an intent so we continue recur-
sively with it.
    The set bcde has a single minimal generator, itself. The first minimal transver-
sal of Gen(bcde) is b. The set bcde \ b = cde is not an intent so we continue
recursively with it.
    The set cde has a single minimal generator, itself. The first minimal transver-
sal of Gen(cde) is c. The set cde\c = de is not an intent so we continue recursively
with it.
    The set de has a single minimal generator, itself. The first minimal transversal
of Gen(de) is d. The set de \ d = e is an intent. The second and last minimal
transversal is e. The set de \ e = d is an intent. The algorithm returns de as the
first pseudo-intent which gives us I = {de → bde}.
10      Alexandre Bazin




     Fig. 3. The five steps of the algorithm. In red the sets considered by the recursive calls,
     in black the lower covers of those sets in ΦI .


        We start again with abcde. It has a single minimal generator, acde. The first
     minimal transversal of Gen(abcde) is a. The set abcde \ a = bcde is not an intent
     so we continue recursively with it.
         The set bcde has a single minimal generator, cde. The first minimal transver-
     sal of Gen(cde) is c. The set bcde \ c = bde is an intent. The second minimal
     transversal is d. The set bcde \ d = bce is not an intent so we continue recursively
     with it.
         The set bce has a single minimal generator, itself. The first minimal transver-
     sal of Gen(bce) is b. The set bce \ b = ce is an intent. The second minimal
     transversal is c. The set bce \ c = be is not an intent so we continue recursively
     with it.
        The set be has a single minimal generator, itself. The first minimal transversal
     of Gen(be) is b. The set be \ b = e is an intent. The second and last minimal
     transversal is e. The set be \ e = b is an intent. The algorithm returns be as the
     second pseudo-intent which gives us I = {de → bde, be → bde}.
        We start again with abcde. It has two minimal generators, acde and abce.
     The first minimal transversal of Gen(abcde) is a. The set abcde \ a = bcde is not
     an intent so we continue recursively with it.

        The set bcde has two minimal generators, bce and cde. The first minimal
     transversal of Gen(cde) is c. The set bcde \ c = bde is an intent. The second
      Depth-first Search for Pseudo-intents through Minimal Transversals         11

minimal transversal is bd. The set bcde \ bd = ce is an intent. The third and
last minimal transversal is e. The set bcde \ e = bcd is an intent. The algorithm
returns bcde as the third pseudo-intent which gives us I = {de → bde, be →
bde, bcde → abcde}.
    We start again with abcde. It has two minimal generators, cde and bce. The
first minimal transversal of Gen(abcde) is bd. The set abcde \ bd = ace is not an
intent so we continue recursively with it.
    The set ace has a single minimal generator, itself. The first minimal transver-
sal of Gen(ace) is a. The set ace \ a = ce is an intent. The second minimal
transversal is c. The set ace \ c = ae is an intent. The third minimal transversal
is e. The set ace \ e = ac is not an intent so we continue recursively with it.
   The set ac has a single minimal generator, itself. The first minimal transversal
of Gen(ace) is a. The set ac \ a = c is an intent. The second and last minimal
transversal is c. The set ac\c = a is intent. The algorithm returns ac as the fourth
pseudo-intent which gives us I = {de → bde, be → bde, bcde → abcde, ac →
abcd}.
    We start again with abcde. It has three minimal generators, ace, cde and bce.
The first minimal transversal of Gen(abcde) is c. The set abcde \ c = abde is
an intent. The second minimal transversal is e. The set abcde \ e = abcd is an
intent. The third and last minimal transversal is abd. The set abcde \ abd = ce
is an intent. The algorithm ends.
    The algorithm has found four pseudo-intents but is unable to find the set bc
when considering the minimal transversals, and thus the sets, in that partic-
ular order. Indeed, knowing be → bde and de → bde makes bcde recognizable
and blocks every path from abcde to bc in ΦI . Finding bc before either de or be
solves the problem.


4.3   Complexity

The nature of Algorithm 2 makes it easy to bound the delay between finding
two pseudo-intents. As the algorithm starts from A and removes attributes until
it ends, it performs a maximum of |A| recursive calls before finding a pseudo-
intent. In a call, three different tasks are performed : computing the closure
of a set, computing the set of minimal generators and computing the set of all
lower covers. The closure of a set is known to be computable in polynomial time.
Computing GenI (A) is done in time polynomial in the size of the output. The
computation of all the lower covers of A, as we have seen in Section 2.4, can be
performed in quasi-polynomial total time, i.e. in time quasi-polynomial in the
size of n = |Gen
              I (A)|
                    and m = |T r(GenI (A))| (note that the size of m itself is
                 n
bounded by  n  [18]). Hence, the delay is in O(|A| × (C + G + T )) with C
                2
the complexity of computing a closure, G the complexity of computing minimal
generators and T the complexity of computing the lower covers.
12       Alexandre Bazin

     Algorithm 4 nextM inP I(A, I)
     Require: Attribute set A, Implication set I
      1: P = A
      2: C =The first minimal set reachable from A
      3: while C 6= ∅ do
      4:   if C 00 6= C then
      5:      P = nextM inP I(C)
      6:      C=∅
      7:   else
      8:      C = nextReach(C, A)
      9:   end if
     10: end while
     11: RETURN P



         An interesting point of detail is that the algorithm does not necessarily enu-
     merate all the intents. When all the upper covers of an intent A in ΦI are
     intents themselves, A cannot be reached by the algorithm anymore. In particu-
     lar, if A ∪ {i} = (A ∪ {i})00 for every i ∈ A \ A, we are sure that A will never be
     considered. In the previous example (see Figure 3), we see that the intents abd,
     ab, ad, bd, de and ∅ are never considered because they do not appear as lower
     covers of sets. We suppose that this property helps reduce the runtime on dense
     contexts with a high number of intents.


     5     Computing a Minimal Recognizable Pseudo-Intent

     5.1    Algorithm

     The problem with Algorithm 3 is that there are some pseudo-intents it cannot
     compute. Among those leftover pseudo-intents, there is necessarily at least one
     that is minimal. We have shown in Proposition 4 that a minimal recognizable
     pseudo-intent P can be recognized through the minimal sets reachable from P .
     Those reachable sets can themselves be obtained by computing the transversals
     of the sets of minimal generators of the pseudo-intents involved in the logical
     closure of the different P \ {i} (Algorithm 1). Algorithm 4 uses these properties
     to compute a minimal recognizable pseudo-intent. In a manner similar to that of
     Algorithm 2, the algorithm starts with A and enumerates the minimal reachable
     sets. Once one that is not an intent is found, the algorithm is recursively applied.
     The algorithm ends when it finds a set with only intents as minimal reachable
     sets. The main difference between Algorithm 2 and Algorithm 4 is that the
     former performs a depth-first search in the lattice ΦI while the latter performs
     it in the directed graph in which the nodes are the elements of ΦI and the edges
     are the pairs in the “reachable” relation.

     Proposition 6. Algorithm 4 returns either A or a minimal pseudo-intent that
     is not in I.
      Depth-first Search for Pseudo-intents through Minimal Transversals      13

Algorithm 5 allP I2
1: I = ∅
2: A = nextminP I(A, I)
3: while A 6= A do
4:   I = I ∪ {A → A00 }
5:   A = nextM inP I(A, I)
6: end while
7: Return I



Proof. From Proposition 4 we know that if every minimal set T ∈ ΦI reachable
from A ∈ ΦI is closed and A is not, then A is a minimal recognizable pseudo-
intent. If T ∈ ΦI is not closed, then T contains a minimal recognizable pseudo-
intent so the algorithm can be recursively called on T . The algorithm stops
when it encounters a set A with only intents as minimal reachable sets. The set
A being the only intent on which the algorithm is called, it returns either A or
a minimal recognizable pseudo-intent. 
    We can enumerate all the pseudo-intents with Algorithm 5 using Algorithm
4 in the same fashion as Algorithm 3.


Proposition 7. Algorithm 5 is sound and complete.

Proof. Soundness follows from Proposition 6. For any implication set I ⊂ B,
pseudo-intent P not used in I and attribute set A ∈ ΦI such that P ⊂ A, no
set X such that P ⊂ X and P 00 6⊆ X can be an intent so either P or one of its
non-closed supersets is reachable from A. Hence, the algorithm does not stop
until every pseudo-intent has been found. 


5.2   Example

Contrary to Algorithm 3, Algorithm 5 is complete. We can either use it to
compute all the pseudo-intents or combine it with Algorithm 3. Let us suppose
we have used Algorithm 3, as exemplified in Section 4, and that we know the
set of implications I = {de → bde, be → bde, bcde → abcde, ac → abcd}. Using
Algorithm 5 from there gives us the following :
    We start with abcde. The first minimal reachable set abcde\(a∪bd) = ce is an
intent. The second minimal reachable set abcde \ (a ∪ bc) = de is an intent. The
third minimal reachable set abcde\(a∪e) = bcd is an intent. The fourth minimal
reachable set abcde \ (b ∪ cd) = ae is an intent. The fifth minimal reachable set
abcde\(b∪e) = cd is an intent. The sixth minimal reachable set abcde\(b∪ce) =
ad is an intent. The seventh minimal reachable set abcde \ c = abde is an intent.
The eighth minimal reachable set abcde \ (d ∪ ab) = ce is an intent. The ninth
minimal reachable set abcde \ (d ∪ ae) = bc is not an intent so we continue
recursively with it.
14       Alexandre Bazin

        The set bc contains no pseudo-intents so its minimal reachable sets are bc\b =
     c and bc \ c = b. They are both intents so the algorithm returns bc as a new
     pseudo-intent which gives us I = {de → bde, be → bde, bcde → abcde, ac →
     abcd, bc → bcd}.
         We start again with abcde. The eight first minimal reachable sets computed
     are the same as in the previous step. The ninth reachable set abcde \ cde = ab
     is an intent. The tenth and last reachable set abcde \ e = abcd is an intent. The
     algorithm ends.
        In this example, the pseudo-intent bc has been found after bcde. Using both
     Algorithm 3 and Algorithm 5, it is thus possible to enumerate pseudo-intents in
     orders that do not extend the inclusion.

     5.3    Complexity
     Once again, the delay between two pseudo-intents with Algorithm 5 is easy
     to bound. Algorithm 4 performs at most |A| recursive calls before returning
     a pseudo-intent or A. Each call requires the computation of, at most, all the
     reachable sets in ΦI . We have shown that this can be done in O(|A| × |I| × T )
     with T the complexity of computing minimal transversals of minimal generators
     of pseudo-intents. Thus, the worst case delay of Algorithm 5 is |I| times that of
     Algorithm 3.
         A first minimal transversal can be computed in polynomial time. As such,
     for any attribute set A, we can compute its first reachable set in ΦI in time
     polynomial in |A| and |I|. Hence, if, for every A ∈ ΦI that is not a pseudo-intent,
     the reachable sets in ΦI are ordered in such a way that the first one is not closed,
     Algorithm 5 can compute (but not necessarily recognize) a pseudo-intent in time
     polynomial in the size of the implication set. A minimal pseudo-intent P , by
     definition, does not contain any other pseudo-intent so the set of its reachable
     set is {P \ {p} | p ∈ P } which can be computed in polynomial time. Thus,
     there are always orderings of reachable sets such that Algorithm 5 can compute
     minimal pseudo-intents with an incremental polynomial time between two of
     them. However, as shown in [6], minimal pseudo-intents cannot be enumerated
     in output polynomial time. Indeed, even though it is possible to compute a
     minimal pseudo-intent in time polynomial in the size of the implication set,
     confirming that they have all been found is exponential because non-minimal
     pseudo-intents and A have a potentially exponential number of reachable sets.


     6     Conclusion
     We have proposed two algorithms for computing pseudo-intents using the com-
     putation of minimal transversals as their main operation. The first can find
     pseudo-intents before some of their subsets that are themselves pseudo-intents
     but it is not complete for some enumeration orders. The second is complete but
     can only enumerate in orders that respect the inclusion relation. They can be
      Depth-first Search for Pseudo-intents through Minimal Transversals          15

combined to obtain the enumeration order of the first with the completeness of
the second. We expressed their delay as a function of the complexity of com-
puting minimal transversals. We showed that, for some orderings of attribute
sets, pseudo-intents can be reached, but not recognized, in incremental polyno-
mial time. As both reaching and recognizing pseudo-intents, in our context, are
related to the problem of computing minimal transversals in hypergraphs for
which many special cases are known, we believe that this work may be useful in
isolating more cases for which the enumeration of pseudo-intents is easier.
    On the practical side, the fact that the algorithms do not necessarily enu-
merate the entirety of the set of intents should significantly reduce the runtime
on dense contexts. Furthermore, the depth-first strategy used in the algorithms
allows for some computations to be saved and reused. Further algorithmic opti-
mizations and empirical comparisons with other algorithms for the same problem
will be the subject of future investigations.


References
 1. Mikhail A. Babin and Sergei O. Kuznetsov. Recognizing Pseudo-intents is coNP-
    complete. In Proceedings of the 7th International Conference on Concept Lattices
    and Their Applications, Sevilla, Spain, October 19-21, 2010, pages 294–301, 2010.
 2. Mikhail A. Babin and Sergei O. Kuznetsov. Computing Premises of a Minimal
    Cover of Functional Dependencies is Intractable. Discrete Applied Mathematics,
    161(6):742–749, 2013.
 3. Konstantin Bazhanov and Sergei A. Obiedkov. Comparing Performance of Algo-
    rithms for Generating the Duquenne-Guigues Basis. In Proceedings of The Eighth
    International Conference on Concept Lattices and Their Applications, Nancy,
    France, October 17-20, 2011, pages 43–57, 2011.
 4. Alexandre Bazin and Jean-Gabriel Ganascia. Enumerating Pseudo-Intents in a
    Partial Order. In Proceedings of the Tenth International Conference on Concept
    Lattices and Their Applications, La Rochelle, France, October 15-18, 2013., pages
    45–56, 2013.
 5. Claude Berge. Hypergraphs: Combinatorics of Finite Sets, volume 45. Elsevier,
    1984.
 6. Felix Distel and Baris Sertkaya. On the Complexity of Enumerating Pseudo-intents.
    Discrete Applied Mathematics, 159(6):450–466, 2011.
 7. Thomas Eiter and Georg Gottlob. Identifying the Minimal Transversals of a Hy-
    pergraph and Related Problems. SIAM J. Comput., 24(6):1278–1304, 1995.
 8. Thomas Eiter, Georg Gottlob, and Kazuhisa Makino. New Results on Mono-
    tone Dualization and Generating Hypergraph Transversals. SIAM J. Comput.,
    32(2):514–537, 2003.
 9. Thomas Eiter, Kazuhisa Makino, and Georg Gottlob.             Computational As-
    pects of Monotone Dualization: A Brief Survey. Discrete Applied Mathematics,
    156(11):2035–2049, 2008.
10. Michael L. Fredman and Leonid Khachiyan. On the Complexity of Dualization of
    Monotone Disjunctive Normal Forms. J. Algorithms, 21(3):618–628, 1996.
11. Bernhard Ganter. Two Basic Algorithms in Concept Analysis. In Preprint Tech-
    nische Hochschule Darmstadt, 1984.
16     Alexandre Bazin

     12. Jean-Louis Guigues and Vincent Duquenne. Familles Minimales d’Implications
         Informatives Résultant d’un Tableau de Données Binaires. Mathématiques et Sci-
         ences humaines, 95:5–18, 1986.
     13. Vladimir Gurvich and Leonid Khachiyan. On Generating the Irredundant Con-
         junctive and Disjunctive Normal Forms of Monotone Boolean Functions. Discrete
         Applied Mathematics, 96-97:363–373, 1999.
     14. Miki Hermann and Baris Sertkaya. On the Complexity of Computing Generators
         of Closed Sets. In Formal Concept Analysis, 6th International Conference, ICFCA
         2008, Montreal, Canada, February 25-28, 2008, Proceedings, pages 158–168, 2008.
     15. Sergei O. Kuznetsov. On the Intractability of Computing the Duquenne-Guigues
         Bas. J. UCS, 10(8):927–933, 2004.
     16. Sergei O. Kuznetsov and Sergei A. Obiedkov. Counting Pseudo-intents and #P-
         completeness. In Formal Concept Analysis, 4th International Conference, ICFCA
         2006, Dresden, Germany, February 13-17, 2006, Proceedings, pages 306–308, 2006.
     17. Sergei O. Kuznetsov and Sergei A. Obiedkov. Some Decision and Counting Prob-
         lems of the Duquenne-Guigues Basis of Implications. Discrete Applied Mathemat-
         ics, 156(11):1994–2003, 2008.
     18. Zbigniew Lonc and Miroslaw Truszczynski. On the Number of Minimal Transver-
         sals in 3-uniform Hypergraphs. Discrete Mathematics, 308(16):3668–3687, 2008.
     19. Sebastian Rudolph. Some Notes on Pseudo-closed Sets. In Formal Concept Analy-
         sis, 5th International Conference, ICFCA 2007, Clermont-Ferrand, France, Febru-
         ary 12-16, 2007, Proceedings, pages 151–165, 2007.
               Edit Operations on Lattices
          for MDL-based Pattern Summarization

                      Keisuke Otaki1,2 and Akihiro Yamamoto1
                  1
                   Department of Intelligence Science and Technology,
               Graduate School of Informatics, Kyoto University, Japan
            ootaki@iip.ist.i.kyoto-u.ac.jp, akihiro@i.kyoto-u.ac.jp
          2
            Research Fellow of the Japan Society for the Promotion of Science


         Abstract. The closedness, a fundamental conception in FCA, is also im-
         portant to address the pattern redundancy problem in pattern mining,
         where some enumerated patterns subsume others and they are redun-
         dant. Although many efficient mining algorithms have been proposed,
         finding characteristic and easy to interpret subsets of the whole enumer-
         ated patterns, called the pattern summarization problem, is still chal-
         lenging. A well-used Information Theory-based criterion; the Minimum
         Description Length (MDL) principle helps us to tackle the problem, but
         it requires us to design the MDL evaluation from scratch according to
         types of patterns. In this paper we propose a new framework applicable
         to various patterns using lattices, which are beneficial to formulate the
         MDL evaluation. A key idea is revising an existing model by using edit
         operations defined on lattices among concepts on them, which enables
         us to consider additional information such as background knowledge and
         helps us to design the MDL evaluation. We experiment our method to
         see that our proposal helps us to obtain informative results from the
         whole sets, and confirm that our model is applicable to various patterns.


  Keywords: formal concept analysis, lattice structures, edit operations, pattern
  summarization, minimum description length principle

  1    Introduction
  Knowledge Discovery using binary feature vectors is an important subject in
  data mining, where a binary value represents whether or not an object keeps
  some feature. A way to construct such vectors is using pattern mining: For a
  datum d and an enumerated pattern p, we compute a binary value bd,p = 1 if
  and only if p occurs in d. Since the frequent itemset mining problem was first
  proposed by Agrawal and Srikant [2], various patterns and algorithms have been
  studied [1]. We can exploit obtained vectors as representations of data within
  several Machine Learning methods, and tasks such as clustering and classification
  can be tackled with the vectors. Therefore pattern mining is an essential tool.
     Although many efficient mining algorithms have been studied, the pattern
  redundancy problem is still inevitable, where some patterns subsume others, enu-
  merated patterns are not independent, and obtained vectors contain redundant

c 2015 for the individual papers by the papers’ authors. Copying permitted for private
  and academic purposes. This volume is published and copyrighted by its editors.
  Local Proceedings also appeared in ISBN 978-84-606-7410-8, Universidad de Málaga
  (Dept. Matemática Aplicada), Spain.
18       Keisuke Otaki and Akihiro Yamamoto


values. Considering the closedness, a main conception in FCA, is a fundamental
way to resolve the problem and an example of the closedness used in pattern
mining is closed itemsets [12], known as compact representations of itemsets.
    The redundancy problem, however, still remains after considering the closed-
ness only. To select a subset of the whole set of enumerated patterns, the Min-
imum Description Length (MDL) principle has attracted much attention [7, 8,
14–16, 18]. An important idea is to represent databases with codes in a loss-less
manner, and to evaluate its difficulty by the length of codes used. For example,
the Krimp algorithm [18] is a fundamental method for itemsets. Other methods
for sequences and large graphs can be seen in [16] and [8], respectively. According
to patterns, we need to design an evaluation method from scratch and it makes
theoretical analyses difficult. It is also hard to integrate MDL-based methods
with knowledge resources although various side information (e.g., corpora, on-
tology, etc.) are helpful to select enumerated patterns.
    Developing methods applicable to various patterns in a unified framework and
possible to consider side information is, therefore, an important but remaining
problem. We in this paper present preliminary results of our new lattice-based
framework based on an existing MDL-based method to enhance its utility for
further developing of such methods to overcome problems above. Our key ideas
are introducing edit operations on lattices (in Section 3.1) for the loss-less rep-
resentation (by Proposition 1, Lemmata 1 and 2), and evaluating its difficulty
by extending an existing formulation (in Section 3.2). To investigate the appli-
cation possibility of our framework, we apply it to pattern concept lattices (in
Section 4). We experiment them using common benchmark datasets (in Sec-
tion 5). Our contributions are summarized below.

    – A new lattice-based framework ; it is extended from an existing model (code-
      tables in Section 2.2) by using terminologies from lattices.
    – The equivalency for closed itemsets (by Proposition 1).
    – Pattern structured-based generalization; GenSet, a generalization of itemsets
      using pattern structures (in Section 4) as an example of other lattices.
    – Empirical evaluation using some benchmark datasets (in Section 5).


2      Preliminary

We outline Formal Concept Analysis (FCA) and Frequent Closed Itemset Mining
(FCIM) for our new lattice-based framework, summarized in Table 1. We explain
an idea of the MDL-based algorithm for itemsets, named the Krimp algorithm.


2.1     Basic Concepts

Formal Concept Analysis [5] Let K = (G, M, I) be a context, where G
and M are sets of objects and attributes, and I ⊆ G × M is a binary relation.
A Galois connection between G and M , denoted shortly by (·)0 , is defined as
A0 = {m ∈ M | (g, m) ∈ I for all g ∈ A} for A ⊆ G and B 0 = {g ∈ G | (g, m) ∈
            Edit Operations on Lattices for MDL-based Pattern Summarization               19



               Table 1: The correspondence between FCA and FCIM.
                  FCA [5]             FCIM with a parameter θ [12]
       objects    G; a set of objects D; a set of transactions
       attributes M ; attributes      I; items
       functions {(·)0 , (·)0 }       {f, g}
       task       find all concepts find all itemsets I ⊆ I satisfying freq(I) ≥ θ



I for all m ∈ B} for B ⊆ M , respectively. A pair (A, B) is a concept if A0 = B
and B 0 = A, where A and B are called the extent and the intent, and we denote
them by A = Ext(c) and B = Int(c), respectively. For an object g ∈ G, the
concept ({g}00 , {g}0 ) is the object concept of g, denoted by γg.
    Computing all concepts is an important task in FCA, and we denote the
set of all concepts by B(K). For two concepts (A, B) and (C, D), the partial
order  is introduced by A ⊆ C, and hB(K), i becomes a complete lattice. We
represent it by B if it is understood from the context. For a concept c ∈ B, we
define the upset of c, denoted ↑ c, by {d ∈ B | c  d}3 . In addition we define the
direct upset ⇑ c by {d ∈↑ c | there exists no e ∈↑ c s.t. e  d}.


Frequent Closed Itemset Mining [12] Let I be the set of items. A set
I ⊆ I of items is called an itemset, and it is also called a transaction. A database
D = {t1 , . . . , tN } is a set of transactions. The frequency freq(I) of an itemset I
is defined as freq(I) = |{t ∈ D | I ⊆ t}|. For a parameter θ, the frequent itemset
mining (FIM) problem is the problem to find all itemsets satisfying freq(I) ≥ θ.
We denote the set of all frequent itemsets by F.
    For closed itemsets, two functions f and g are defined as f (T ) = {i ∈ I | ∀t ∈
T, i ∈ t} for T ⊆ D and g(I) = {t ∈ D | ∀i ∈ I, i ∈ t} for I ⊆ I. An itemset I
is closed if and only if f (g(I)) = I. The frequent closed itemset mining (FCIM)
problem is to find all frequent and closed itemsets from D with respect to θ. It is
easy to see that two functions f and g work in the same manner of (·)0 in FCA4 .
We denote the set of all frequent closed itemsets by CF.
    The closed itemsets aim at reducing the number of all frequent itemsets
enumerated in the naı̈ve FIM problem, and many efficient algorithms have been
studied [13, 17]. However the number of enumerated itemsets is still large in
applications. They are often redundant because some patterns subsume others,
which causes the pattern redundancy problem: The enumerated patterns are too
large and redundant to exploit them in applications although the closedness is
considered. In application, we often need only a small subset that characterizes
all the enumerated itemsets, which are easy to interpret. We call such extraction
problem pattern summarization (also known as pattern set mining [6, 15, 18]).
3
    This set ↑ c is also called the principal filter in partially ordered sets.
4
    We prepare ids of transactions and the set D0 = {(i1 , t1 ), . . . , (iN , tN )}. By con-
    structing I = {(ik , j) | 1 ≤ k ≤ N, j ∈ tk } from D0 , the FCIM problem of θ = 1 is
    equivalent to computing all intents of concepts from K = ({i1 , . . . , iN }, I, I).
20         Keisuke Otaki and Akihiro Yamamoto

      (X, usage, code)
      1    3    1
      2    4    2                                             (X, usage, code)
                             Encoded database                                     Encoded database
      3    2     3                                            2345   1    2345
      4    1      4      1    2   5           6       7   9   2379   1    2379   179     25    6
      5    2     5       2    3       4       5               179    3   179      2345
      6    1      6
                         1    2   7       8       9            25    1     25
                                                                                 179     2     8
      7    4    7                                               2    1      2
      8    1             1    7   9                                              179
                  8                                             6    1      6
      9    4    9        2    3   7       9                     8    1      8     2379


 (a) The standard code-table ST (left) and (b) A code-table CT (left) and the cover
 the cover of D by ST (right).             of D by CT (right).

    Fig. 1: The cover of D = {125679, 2345, 12789, 179, 2379} (from Example 1).


2.2       Two-part MDL Principle and The Krimp Algorithm
Out of various criteria for the selection, the two-part MDL principle [7] is often
adopted for itemsets, where code-tables are well-studied models. Databases (or
contexts) are encoded and represented by (binary) codes using code-tables.

Definition 1 (Code-tables [18]). A code-table CT consists of two columns
for itemsets and codes. Formally CT = {(X1 , c1 ), (X2 , c2 ), . . . } is a set of pairs,
where Xi ⊆ I, ci ∈ C, and C be the set of (binary) codes.

    For the convenience of our discussion, in the following, we regard code-tables
as sets of itemsets: For example X ∈ CT means (X, cX ) ∈ CT and cX is denoted
by code CT (X). We define CT − = {(X, cX ) ∈ CT | |X| > 1}. The standard code-
table ST consists of all singleton itemsets over I. Let us first introduce examples,
where sets are written as sequences (e.g. {1, 7, 9} is represented as 179).

Example 1. Let I = {1, . . . , 9} and D = {125679, 2345, 12789, 179, 2379}. In
Fig. 1a and 1b, the left table is a code-table and the right one is a database
represented by codes, where rectangles represent codes (e.g., the rectangle 7
means the code of the itemset 7). In Fig. 1a, the transaction 12789 is encoded
separately. In Fig. 1b (CT = ST ∪{25, 179, 2345, 2379}), it is done by 179∪2∪8.

   The process of representing databases using code-tables is called the cover.
Both frequent itemsets (e.g., 179) and non-frequent ones (e.g., 2345 or 2379) are
used and the MDL principle takes a balance among them. The property of the
cover of transactions is described by cover functions.

Definition 2 (Cover function5 ). Let CT be the set of all possible code-tables
over I, D be a database, t ∈ D be a transaction, and CT ∈ CT be a code-table.
We call the set of itemsets {X | (X, cX ) ∈ CT } the coding set of CT , denoted
by CS . A cover function on CT , cover : CT × 2I → 2CS , should satisfy:
C1. if X, Y ∈ cover (CT , t) then it holds either X = Y or X ∩ Y = ∅, and
5
     This definition is modified from Definition 2 of [18] by using a coding set CS .
          Edit Operations on Lattices for MDL-based Pattern Summarization             21

                                                                 S
C2. the union of all X ∈ cover (CT , t) equals to t, i.e., t =       X∈cover (CT ,t) X.

    Code-tables in Fig. 1 contain additional information usage. Using coding
methods in the Kolmogorov Complexity [9], codes are decided according to how
often itemsets are needed : For X ∈ CT , usage(X) is defined as |{t ∈ D | X ∈
cover (CT , t)}|, which determines the least length of prefix codes required.
Example 2 (from Example 1). In Fig. 1a, cover (ST , 179) = {1, 7, 9}, and in
Fig. 1b cover (CT , 12789) = {179, 2, 8}. It holds that usage(179) = 3.
If a transaction t ∈ D is encoded by a cover function cover (·, ·) and some CT ,
we say that t is covered. Then the MDL principle can be evaluated.
Definition 3 (Lemma 1, Definitions 4 and 5 in [18]). The two-part MDL
principle says that a code-table CT ∈ CT is the best for D if it minimizes
L(D, CT ) = L(CT ) + L(D | CT ), where L(CT ) is the length of the code-table
CT , and L(D | CT ) is that of D when CT is given. Following Definitions 1 and
2, terms L(CT ) and L(D | CT ) for L(D, CT ) are evaluated as follows:
                              X
            L(CT ) =                   L(code ST (X)) + |code CT (X)|,      (1)
                        X∈CT if usage(X)6=0
                        X        X
         L(D | CT ) =                         |code CT (X)|,                         (2)
                        t∈D X∈cover (CT ,t)
                         P
where L(code ST (X)) =      i∈X |code ST ({i})|.

   The Krimp algorithm [18] is an iterative algorithm updating a code-table
CT ∈ CT greedy by adding a new frequent itemset X ∈ F with evaluating
L(D, CT ). Please refer to Algorithm 3 in [18], which we follow in experiments.

2.3   Problems
Models based on code-tables have been widely used in MDL-based studies. The
cover function adopts the set union to represent transactions, and the subset
operation to divide them as seen in Example 1. The process assumes that items
are in the same granularity, but it is not the case in applications. It also disables
us to allow for side information (i.e., our background knowledge and resources)
about databases as well. Then our targeting problems are listed below.
Problem 1. The cover function has a drawback of the difficulty of its generaliza-
tion because it implicitly uses set operations.
Problem 2. Code-tables cannot reflect side information of items (e.g., hierarchy,
or degree of importance of items) because of implicitly used set operations.
   That is we need to revise set operations used in the existing method. Our
key idea is to adopt terminologies of lattices to overcome these problems. In
Section 3, we revise the cover function with lattices to build our new framework.
We apply it to pattern concept lattices by pattern structures in Section 4 to
confirm that our lattice-based framework is applicable to various patterns.
22       Keisuke Otaki and Akihiro Yamamoto


3      Models by Edit Operations for Closed Itemsets

We provide our framework based on lattices. For discussions, we define the object
concept corresponding to t ∈ D by γt = (g(t), t) from the relation in Table 1.
Here we only consider closed itemsets.


3.1     Cover Using Edit Operations on Lattices

In the cover of t ∈ D, t is recursively replaced with t \ B1 , t \ (B1 ∪ B2 ), . . .
by choosing mutually disjoint closed itemsets B1 , B2 , . . . until t gets ∅ in some
order6 . We now have the following.

Observation 1 The cover of t ∈ D can be represented as a sequence of the
deletion of items: t 7→ t \ B1 , t \ B1 7→ t \ (B1 ∪ B2 ), . . . , t \ (B1 ∪ · · · ∪ Bm ) 7→ ∅.

Thus a transaction t ∈ D can be covered by also using the deletion of items.
We use such deletions to construct a new cover based on lattices. On lattices,
selections of Bi corresponds to selecting concepts from B, and intents of them
can be stored in code-tables. With respect to the search space of the selection,
the following holds by properties of lattices.

Observation 2 For a transaction t ∈ D, only a sub-lattice corresponding to the
upset ↑ γt of the corresponding object concept γt of t should be considered.

    Choosing concepts, however, is not complete for the loss-less property (i.e.,
C2 of Definition 2) because in general B does not contain all singleton itemsets
in intents of concepts. In contrast, by the discussion of closed itemsets [12], all
frequent itemsets could be recovered from closed itemsets and lattices should
contain all information required.
    Our idea is extracting such hidden information on lattices as edit operations
on lattice (e.g., the deletion of items in Observation 1), and represent t ∈ D by
both the selections and edit operations. We first introduce edit information.

Definition 4 (Edit Information). For c = (A, B) ∈ B,   S the edit information,
denoted by Edit(A, B), is defined as Edit(A, B) ≡ B \ (C,D)∈⇑(A,B) D. For a
                                              S
set C of concepts, it is extended by Edit(C) = (A,B)∈C Edit(A, B).

For c ∈ B, Edit(c) indicates items which cannot be represented by intents of
concepts except c itself even when all concepts in ↑ c \ {c} are chosen.

Example 3 (from Example 2). For t3 = 12789, it holds that γ3 = (3, 12789) and
Edit(γ3) = {8}. The item 8 cannot be represented by any concepts except γ3.
Therefore if γ3 is not selected the item 8 must be represented by another way.
6
     The standard cover order is a basic one from the Krimp algorithm: For an itemset
     B ⊆ I, the standard cover order is the order of the cardinality |B| descending, the
     support freq(B) descending, and lexicographically ascending [18]
              Edit Operations on Lattices for MDL-based Pattern Summarization                            23


     Edit(1345,79)={7,9} (12345,{})                                             (12345,{})
                                  Edit(1235,2)={2}                                        Edit(1235,2)={2}
                   (1345,79)                                            (1345,79)
 Edit(134,179)={1}                       (1235,2)                                             (1235,2)
                                                                   CT
              (134,179)                                           (134,179)
                          (135,279)                                           (135,279)
           (13,1279)                  (12,25) (25,23)          (13,1279)                  (12,25) (25,23)
    Edit(γ3)={8}                                        Edit(γ3)={8}
                       (1,125679) (5,2379)                                  (1,125679) (5,2379)
           (3,12789)                         (2,2345)          (3,12789)                          (2,2345)
                          ({},123456789)                                       ({},123456789)

          (a) For t3 = 12789 with ST                          (b) For t3 = 12789 with CT

Fig. 2: Examples of the cover with edit operations for γ3 = (3, 12789) (i.e., t3 ).


    As another way, edit operations among intents of concepts are introduced
to represent such information. Because B contains the top > = (G, ∅)7 and
Int(γt) = t, all information for covering t should be found in ↑ γt. We use
edit information to capture such information and to explain edit operations on
lattices (i.e., on intents of concepts). We encode edit information by tuples called
edit scripts, and store them in edit-tables by analogy with code-tables.
Definition 5 (Edit scripts and Edit-tables). An edit-table ET consists of
three columns: kinds of operations, arguments of operations (itemsets), and
codes. That is, ET = {(t1 , X1 , c1 ), . . . , (tm , Xm , cm )}, where ti indicates the
type of operations, and both Xi and ci is the same to code-tables. The tuple
(t, X, cX ) is called an edit script. Let E be the set of all edit scripts.
Example 4 (from Example 3). Because Edit(γ3) = {8}, the item 8 should be
represented by the deletion of the item 8. Let DEL represent the deletion 8 . A
script (DEL, 8, c8 ) with a code c8 means the deletion of the item 8.
   Definition 5 is general and abstract. For closed itemsets, however, we only
need a specific kind of them. We define a function Encode : 2I → 2E for X ⊆ I
by Encode(X) = {(DEL, i, ci ) | i ∈ X, ci ∈ C}, where C is the set of codes.
Lemma 1. It holds that Edit(↑ γt) = t for any t ∈ D.
Proof If ↑ γt = {γt, >}, it holds that Edit(↑ γt) = Edit(γt) = t. Then Encode(·)
generates edit scripts of all items in t as the deletion of an item. By the induction
on lattices (by ⇑ c), we only need to confirm that for c ∈ B all items which cannot
be covered by intents of concepts in ⇑ c \ {c} can be represented by edit scripts.
This is done by the definition of Edit(c) and properties of hB(K), i.              t
                                                                                    u
    Lemma 1 guarantees that any t ⊆ I can be represented by using edit scripts
when no concepts are chosen. Now the cover using lattices can be described using
a function cover L (·, ·) by analogy with cover (·, ·). The following describes how
to implement it instead of giving a new formal description of the function.
7
    If B 6= ∅ for the top > = (G, B) of hB(K), i, we can insert a dummy (G, ∅).
8
    The type is required to introduce several types of edit operations if needed.
24     Keisuke Otaki and Akihiro Yamamoto

                                                                               I
Definition 6 (Revised Cover). The function cover L : CT × 2I → 22 × 2E re-
ceives a code-table CT containing intents of selected concepts and a transaction
t, and it returns a set of itemsets used (e.g., 179) and that of edit scripts.
    For a code-table CT (See Fig. 2b), we first choose intents of concepts from
CT in some order, and denote by C the set of selected concepts, which should
be excluded from the search space ↑ γt. The remainder should be represented by
edit scripts. Then cover L (CT , t) = ({Int(c) | c ∈ C}, Encode(Edit(↑ γt\ ↑ C))).

Example 5 (from Example 3). It holds that cover L (CT , t3 ) = ({179}, {(DEL, 2, c2 ),
(DEL, 8, c8 )}), representing the cover of t3 in Fig. 2b. It holds that cover L (ST , t3 ) =
{∅, {(DEL, 1, c1 ), (DEL, 2, c2 ), (DEL, 7, c7 ), (DEL, 8, c8 ), (DEL, 9, c9 )}}, which sim-
ulates the cover of t3 by singleton itemsets (in Fig. 2a).

   If the set C in Definition 6 is the empty set ∅, the function cover L (·, ·) achieves
the same result as the function cover (·, ·) with ST , supported by Lemma 1 (See
Example 5). If some concepts are selected, the following lemma is important.

Lemma 2. Let C be the set in Definition
                                  S       6 and (X , Y) = cover L (CT , t). The
set Y is complete to rerpesent t \ X∈X X.

Proof From Definition 6, if a concept c = (A, B) in ↑ γt is selected and stored
in CT (i.e., c ∈ C), any item i ∈ B is not represented by edit scripts in Y
because there exists no conceptS d ∈↑ γt\ ↑ C satisfying i ∈ Edit(d). It holds that
{i | i ∈ Edit(↑ γt\ ↑ C)} = t \ X∈X X, and the function Encode(·) ensures the
lemma, i.e., the cover by cover L (·, ·) is valid when some concepts are chosen.

   Together with the completeness in Lemma 1 and Lemma 2, our edit scripts
are enough to represent any t ∈ D with selecting concepts in C.


3.2   Model Evaluation Using Edit Operations

For our model (CT , ET ), with reference to Section 2.2, the MDL principle
L(D, CT , ET ) = L(CT , ET )+L(D | CT , ET ) can be evaluated. For L(CT , ET ),
                             X
         L(CT , ET ) =                  L(code ST (X)) + |cX |
                          (X,cX )∈CT
                          usage(X)6=0
                              X
                      +                  L(code OP (t)) + L(code ST (X)) + |cX |,     (3)
                          (t,X,cX )∈ET
                           usage(X)6=0


where code OP (t) represents codes used forPdistinguishing types of edit scripts.
For itemsets, L(code OP (t)) = 0. The first
                                       P       term of Equation 3 is the same to
the Krimp algorithm, and the second        term is for encoding the edit-table ET .
   The difficulty of representing databases, L(D | CT , ET ), is evaluated by
naturally generalizing the corresponding term of the Krimp algorithm below.
             Edit Operations on Lattices for MDL-based Pattern Summarization                        25

         T: (X, usage, code)
      2345      1        2345
      2379      1        2379
                                    -              (X, usage, code)
       179      3      179         T                                       (Type, X, usage, code)
        25      1          25                    2345     1    2345
         2      1           2                    2379     1    2379        DEL 2    1     DEL 2
         6      1           6                     179     3   179          DEL 6    1     DEL 6
         8      1
                                   ST
                            8                      25     1     25         DEL 8    1     DEL 8

   (a) Code-table T (= T − ∪ ST )                (b) Our model, two tables (CT , ET )

Fig. 3: A code-table T by the Krimp algorithm and our two tables (CT , ET ).
Our table CT is equivalent to T − and ET simulates ST .



                                                                                   
                                   X           X                      X
               L(D | CT , ET ) =                        |cX | +                 |cX |
                                   t∈D       (X,cX )∈X             (t,X,cX )∈Y

                                for (X , Y) = cover L (CT , t).                                     (4)

We illustrate an example in Fig. 3 to show how edit-tables intuitively work.

3.3   The Equivalency of Models for Closed Itemsets
For closed itemsets, we now have the following proposition.
Proposition 1 Let T be an obtained code-table by the Krimp algorithm for
the database D with some order on CF. With the same order, our model can
achieve a pair (CT , ET ) satisfying L(D, T ) = L(D, CT , ET ) for closed itemsets.

Proof (Sketch) At the beginning, it holds that T = ST (i.e., T − = ∅) and
CT = ∅, where all transactions should be covered by singleton itemsets. From
the definition of Encode(·) and Lemma 1, the proposition immediately holds.
If a closed itemset B ∈ CF is selected and stored as B ∈ T , there exists a
concept (A, B) ∈ B and its intent B is stored in our table CT . Because we
use the same order in the cover of t ∈ D, for (X , Y) = cover L (CT , t), it holds
that X = cover (T, t), meaning that the usage in T − and CT must be identical.
The usage of edit scripts must be the same as well by the fact X = cover (T, t),
Lemmata 1, 2, and the definition of Encode(·).                                   t
                                                                                 u

Summary Our idea is to revise the cover in the Krimp algorithm with lat-
tices. Our framework divides code-tables into a part of intents of concepts (i.e.,
closed itemsets) and that of edit scripts simulating singleton itemsets. For closed
itemsets, our model can achieve the same result as that obtained by the Krimp
algorithm (Proposition 1), which are supported by properties of lattices.
    The advantage of our lattice-based model is that lattices help us to generalize
the MDL-based evaluation for various patterns. For example, pattern concept
26     Keisuke Otaki and Akihiro Yamamoto


lattices [4] enable us to address the problem for several patterns with the MDL
principle. The key concept for the generalization is to introduce edit operations
on lattices according to patterns, encode them as edit scripts in edit-tables,
where lattices become the search space in our framework. Lattices help us to
consider the completeness of edit operations as well.


4     A Generalized Model via PSA

We apply our framework to pattern concept lattices. As an example of other lat-
tices, we consider GenSets, a generalized version of itemset using pattern struc-
tures, which are introduced to analyze non-binary data [4]. We call methods
using them for Knowledge Discovery Pattern Structure Analysis (PSA) below.


4.1   Pattern Structures and Generalized Sets

Pattern structures extend FCA using meet semi-lattices; a meet semi-lattice
(D, u) is a pair of a set D of descriptions representing fragments of objects
as their features, and a meet operator u between descriptions satisfying the
associativity, the commutativity, and the idempotency. A partial order v of
descriptions is induced by u for x, y ∈ D as x v y whenever x u y = x. A pattern
structure P is a triple P = (G, (D, u), δ), where G is a set of objects, (D, u) is a
meet semi-lattice, and a mapping δ : G → D gives a descriptions for each object.
    In PSA we use the following Galois connection (·) : A = ug∈A δ(g) for A ⊆
G and d = {g ∈ G | d v δ(g)} for d ∈ D. A pair (A, d) ∈ 2G × D is a pattern
concept if and only if A = d and d = A. A partial order among pattern
concepts is defined as well, and pattern concept lattices can be constructed in a
similar manner of FCA. Recent applications of PSA and lattices constructed by
PSA can be seen in [3, 11]. We consider a slight modification of FCA by giving
additional information to itemsets from the following scenario.

Example 6 (from Example 1). Let us consider t4 = 179 of id i4 and t5 = 2379
of id i5 . We have (i4 i5 )0 = 79 computed by i04 ∩ i05 = 179 ∩ 2379 = 79. The
remainders for each transaction are 1 = 179 \ 79 and 23 = 2379 \ 79, respectively.
In this case, the existence of these items (i.e., 1 in t4 and 2, 3 in t5 ) is ignored
although they are useful for the representation of itemsets.

   From this example, by using a meet semi-lattice, we introduce auxiliary num-
bers representing the existence of items, i.e., the number of items may be included
at most in common, to enhance the expressiveness of itemsets.

Definition 7 (GenSets). We define D = 2I × N. For an itemset B ⊆ I, we
encode it by a pair (B, 0) ∈ D. The meet operator u is defined as

(B1 , n1 ) u (B2 , n2 ) = (B1 ∩ B2 , min(n1 + |B1 \ B1 ∩ B2 |, n2 + |B2 \ B1 ∩ B2 |)).
           Edit Operations on Lattices for MDL-based Pattern Summarization      27


We call itemsets with additional numbers GenSets (generalized sets of items) 9 .

Example 7. For two itemsets t4 = 179, t5 = 2379, we encode them as d1 =
(179, 0) and d2 = (2379, 0), respectively. Then d = d1 u d2 is computed as
d = (B, n) = (79, 1), where n = 1 means that both itemsets t4 and t5 possibly
have further one item. In other words, a GenSet (79, 1) can be regarded as
{7, 9, ?} with a wildcard symbol ? representing any item in I.

    We can interpret GenSets as a richer representation, and conjecture that
they are helpful to capture characteristic features of itemsets. In other words,
we now consider to take care of both the extent and the intent of concepts
in evaluating the MDL principle. However, the Krimp algorithm cannot deal
with GenSets directly because the symbol ? is not in I and it has completely
different meaning. Therefore, applying the ordinal cover is not suited and a new
formulation is required from scratch. In contrast, our lattice-based model can
be applicable. We can evaluate the two-part MDL principle for GenSets by only
considering the cover function cover L (·, ·) using edit operations including ?.


4.2    The two-part MDL principle for GenSets

We can evaluate the two-part MDL principle only by considering the function
cover L (·, ·) using edit operations based on pattern concept lattices.


Models for GenSets For a GenSet (B, n) ∈ D, we need to encode n to store
GenSets in a code-table, which is possible by adding a column for storing the code
for n. There exist several codes for numbers (e.g., arithmetic codes). We choose
a simple one, which requires the length − log2 (1/P ) bits where P is the number
of possible candidates, and n is bounded by the minimum size of transactions in
D. Assuming this code, we set P = |I|, the worst length of such representation.


The Cover with GenSets For d = (B, n) ∈ D, if n = 0, d is an itemset B.
We focus on how to define the cover with GenSets when n > 0.

Example 8. Let d = (B, n) = (79, 1) and t3 = 12789. We first cover a part of
t3 by the itemset B and get 128 = t3 \ B. Additionally we rewrite a symbol ?
(n = 1) for {1, 2, 8}. An item 8 is chosen and code ST (8) is used for this choice
because the code of 8 in ST is longer than that of 1 and 2.

Please see Fig. 4 as well for the cover of t3 using a GenSet d = (79, 1) ∈ D. As
the symbol ? subsumes any items in I we need to evaluate such subsumptions.
To implement it, we assume that we would like to find general, common features
which characterize many transactions by excluding non-frequent items as long
9
    For three sets X, Y, Z ∈ 2I we want to compute n = min(|X| − |XY Z|, |Y | −
    |XY Z|, |Z| − |XY Z|), where XY Z = X ∩ Y ∩ Z, and we see that this value n can
    be computed by the counting and set operations.
28     Keisuke Otaki and Akihiro Yamamoto

                                                        (12345,({},3))
               GenSet (79,1)
            Traverse                    (1345,(79,1))    selected

        additional 1 item                                           (1235,(2,3))
                                 (134,(179,0))
                                                 (135,(279,1))
             item 8 of
                               (13,(1279,1))
           longest-code                                        (12,(25,2)) (25,(23,2))
                           (3,(12789,0)) (1,(125679,0)) (5,(2379,0))         (2,(2345,0))


Fig. 4: For t3 = 12789 with CT = {(79, 1)}, where (79, 1) is a selected GenSet.
An item can be covered by the GenSet (79, 1) and we need to find it except 79.



as possible. We implement the subsumption of ? as replacing the symbol ? with
i ∈ I by finding the item i having the longest code ST (i). Instead of giving a
formal description, we give sketches of our idea below.

Sketch 1 We prepare a type SUB and construct an edit script as (SUB, i, ci ).
As we have two types (DEL and SUB), L(code OP (t)) in Equation 3 is used. The
function Encode(·) should be revised with traversing of an item i having the
longest code. For example in Fig.4, cover L ({(79, 1)}, 12789) now should work
like ({(79, 1)}, {(DEL, 1, c1 ), (DEL, 2, c2 ), (SUB, 8, c8 )}) after traversing i = 8.


Sketch 2 For d = (B, n), if n > 0 (i.e., the code of n shows non-zero value), we
use a code code ST ({i}) for representing the item i of the longest code in ST with-
out edit scripts. This is done by extending the function cover L (·, ·), as a function
like cover new
           L ({(79, 1)}, 12789) = ({(79, 1)}, {(DEL, 1, c1 ), (DEL, 2, c2 )}, {8}), for
example, where {8} can be immediately evaluated by ST computed from D.


    In dealing with pattern concept lattices, many ways are applicable to fulfill
the loss-less property. Because both sketches are based on our edit-tables for
itemsets in Section 3 and they adopt prefix codes, they achieve the loss-less
property if they have codes for items subsumed by ?. As they have such codes in
fact, the loss-less property holds for GenSets. In experiments we use Sketch 2.

Corollary 1 (The Loss-Less Property of GenSets) The loss-less property
immediately holds for GenSets from discussions above.


    We stress that lattices are compact representations of objects and their fea-
tures, and they help us to prepare edit operations (i.e., edit scripts) and the cover
function. Furthermore, we can introduce more labels in subsumptions of ? for
representing various information to take into account valuable side information
in pattern mining. In addition, we expect that GenSets can be used not only for
the symbol ? but also for representing hierarchy by stacking GenSets as trees.
            Edit Operations on Lattices for MDL-based Pattern Summarization    29



Table 2: Benchmark datasets represented in the form of contexts (G, M, I). Re-
call that CF is the set of all closed itemsets with θ = 1.
                          objects |G| attributes |M | |I|/(|G||M |) |CF|
                 Iris             150             19         0.263    163
                 breast           699             16         0.624    641
                 pima             768             38         0.237 3,204
                 wine             178             68         0.206 14,555



5      Experiments
From the UCI Machine Learning repository [10], we use iris10 , breast11 , pima12 ,
and wine13 , which are also used in the Krimp algorithm [18] as benchmark
datasets. All of those datasets are available from the distribution package14 of
the Krimp algorithm. The statistics of those datasets are summarized in Table 2.
   All of our experiments were made on a machine of Mac OS X 10.10.2 with
2 × 2.26 GHz Quad-Core Xeon processors on 64GB main memory. All codes
were written in Python 2.7.9. Our algorithm was the same as Algorithm 3 in the
paper [18], which is an algorithm iteratively and greedy updating a code-table,
except the part of the MDL evaluation. In our framework we replaced it with
our implementation of Equations 3 and 4 based on lattices and edit operations.
We did not discuss the computational complexity of our framework seriously.
Rather than discussing solely run-times of experiments, we saw the followings.
     – Comparison of two models: Results from two-models become the same by
       the equivalency of them (Proposition 1) for itemsets. We then investigate
       how much computational time our new model increases.
     – We analyze the result with GenSets to check the application possibility of
       our lattice-based model for other patterns.
    Let CT be the obtained code-table after applying the Krimp algorithm. For
a database D we can compute both L(D, ST ) and L(D, CT ). We then measured
the ratio L(D, CT )/L(D, ST ) (used in [18]), which can be regarded as a kind of
the degree of compression achieved by CT , denoted Ratio in results. In addition
we also measured the size |CT − | because it indicates how many concepts are
chosen out of the whole set of enumerated patterns. Furthermore, in experiments
with GenSets, we measured the Jaccard index : J(X, Y ) ≡ |X ∩ Y |/|X ∪ Y | for
two sets X and Y , between a set of identifiers of closed itemsets CF and that of
GenSets, to see the difference between results by itemsets and those by GenSets.
Note that in our setting, the set of pattern concepts by GenSets can be regarded
as a set of closed itemsets if we ignore additional numbers n in (B, n) ∈ D.
10
   https://archive.ics.uci.edu/ml/datasets/Iris
11
   https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
12
   https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
13
   https://archive.ics.uci.edu/ml/datasets/Wine
14
   http://www.patternsthatmatter.org/software.php, confirmed Feb. 2015.
30      Keisuke Otaki and Akihiro Yamamoto



Table 3: Results for itemsets by code-tables (the Krimp algorithm) and two-
tables (concepts and edit-tables in our framework).
                                        Krimp                       I-Edit
             |CF| |CT − | Ratio Pre-proc. [s] Run-time [s] Pre-proc. [s] Run-time [s]
     iris     163     13   0.456       0.025        0.234         0.021       0.520
     breast   641     29   0.181       0.580        6.263         0.573      26.794
     pima   3,204     66   0.356      18.202       47.525        19.107      373.63
     wine 14,555      72   0.757     442.686      393.539       444.842    1354.570



   In results, we labeled by Krimp to show results by the Krimp algorithm,
by I-Edit to show results by our framework using lattices and edit operations,
and by Gen-Edit for results by GenSets using pattern concept lattices.


5.1     Results and Discussions for Itemsets

We show in Table 3 results of experiments for itemsets based on the existing
method Krimp and our framework I-Edit, where we focus on run-times only.
Note that we, of course, confirmed that results are equivalent.
     With respect to the time for pre-processing, there existed no large differences.
Concerning the time for the iterative updating process of models, roughly speak-
ing our new model is at most 10 times slower than code-tables. Because both
models are not implemented by sophisticated manners, we conjecture that our
implementation cannot exploit pre-processed information on lattices well. That
is, computing edit-tables using lattices is time-consuming and not sophisticated.
Because the part of edit operations apparently depends on implementations of
lattices and data structures for concepts, we expect that some advanced data
structures help us to overcome this bottleneck.


5.2     Results and Discussions for GenSets

We conducted the same experiments using the same datasets with GenSets (in
Section 4.1) and our MDL evaluation (in Section 4.2). Results are summarized
in Table 4. We can see that the numbers of selected concepts via GenSets (i.e.,
12, 24, 62, and 40) are lower than those via itemsets (i.e., 13, 29, 66, and 72),
which mean that obtained models via GenSets of databases are more compact
than those via itemsets. The same trend can be confirmed in the Ratio column.
    The run-times of experiments have become faster than those in Table 3.
Because the numbers of concepts (itemsets) and pattern concepts (GenSets) are
the same now, these results should arise from the differences of the cover and
the MDL evaluation. In addition the contents of the selected patterns should be
affected. We can see the difference of the contents of the selection by checking
the Jaccard index. As we can see that, our GenSets formulation surely generates
completely different results compared with the results for itemsets.
         Edit Operations on Lattices for MDL-based Pattern Summarization        31



         Table 4: Results by our lattice-based framework for GenSets.
                                           Gen-Edit
                 |CF| |CT − | Ratio Pre-proc. [s] Run-time [s] Jaccard Index
       iris     163       12   0.428       0.009       0.550          0.785
       breast   641       24   0.168       0.351      20.666          0.293
       pima   3,204       62   0.342       18.92     285.676          0.123
       wine 14,555        40   0.686     415.376     454.631          0.131



    Furthermore, the existence of stars ? in GenSets affect the results. For GenSets
(in Section 4.2), we aggressively cover items having longer codes in ST if the
symbol ? appears. As less frequent itemsets have longer codes in the table ST
and such items are covered by ? in advance, concepts stored in our model have
relatively high frequency in the database, and the results get more compact.

Concluding Remarks on Experiments, and Future Directions We con-
ducted the MDL-based pattern summarization for selected benchmark datasets
by using code-tables and our model (two tables). We also applied our model for
GenSets to which we can apply the same framework. At least we confirmed that
our model obtained more compact results using GenSets (checked by Ratio),
and the contents are also different (done by Jaccard index).
    With respect to the quality of the choice of concepts via the two-part MDL
principle it is always open to discuss. To discuss the quality, the previous work
[18] developed the classifier using code-tables, tested it for labeled data, and
discussed the quality of their choice. Therefore following their strategy and de-
veloping a classifier using edit operations and lattices are our future work to
consider the pros and cons of our model and the quality of the choice.
    Further discussion on the relation of run-times for edit-tables and properties
of lattices (e.g., the density of contexts), is also important. Because the order
of checking the set CF is optional, developing and analyzing further heuristic
based on lattices is also important. We expect that several terminologies (not
used in this paper) and methods for lattices would be helpful for this direction.


6   Conclusions and Future Work
In this paper we propose a new framework for pattern summarization to con-
quer the redundancy problem for various patterns by combining lattices and
edit operations, which are evaluated by the two-part MDL principle. Our ex-
periments show that our model successfully extends an existing model towards
more general patterns. As an example we confirm that our model is applicable
to GenSets, defined by pattern structures as a richer representation of itemsets.
For closed itemsets, our model achieves the same results as the existing model
(Proposition 1). Results of Ratio in experiments, GenSets are useful to obtain
more compact representations of database compared with an existing model.
32     Keisuke Otaki and Akihiro Yamamoto


    In our future work, we plan to develop a classifier based on the MDL principle
by following the existing work for discussing the quality of the selection. We
also investigate encoding methods and heuristic for the cover towards various
patterns together with side information. Theoretical analysis of the two-part
MDL evaluation via lattices is also our important future research direction.

Acknowledgements. The authors are grateful to the anonymous reviewers
for valuable comments. This study was partially supported by Grant-in-Aid for
JSPS Fellows (26-4555) and JSPS KAKENHI Grant Number 26280085.


References
 1. Aggarwal, C.C., Han, J. (eds.): Frequent pattern mining. Springer International
    Publishing (2014)
 2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large
    databases. In: Proc. of the 20th VLDB. pp. 487–499 (1994)
 3. Buzmakov, A., Egho, E., Jay, N., Kuznetsov, S.O., Napoli, A., Raı̈ssi, C.: The
    representation of sequential patterns and their projections within formal concept
    analysis. In: Workshop Notes for LML (ECML/PKDD2013) (2013)
 4. Ganter, B., Kuznetsov, S.O.: Pattern structures and their projections. In: Proc. of
    the 9th ICCS. pp. 129–142 (2001)
 5. Ganter, B., Wille, R.: Formal concept analysis - mathematical foundations.
    Springer (1999)
 6. Geerts, F., Goethals, B., Mielikäinen, T.: Tiling databases. In: Proc. of the 7th
    DS. pp. 278–289 (2004)
 7. Grünwald, P.D.: The minimum description length principle. The MIT Press (2007)
 8. Koutra, D., Kang, U., Vreeken, J., Faloutsos, C.: VOG: Summarizing and under-
    standing large graphs. In: Proc. of the 22nd SDM. pp. 91–99 (2014)
 9. Li, M., Vitnyi, P.M.: An introduction to kolmogorov complexity and its applica-
    tions. Springer Publishing Company, Incorporated, 3 edn. (2008)
10. Lichman,        M.:      UCI       machine        learning   repository      (2013),
    http://archive.ics.uci.edu/ml
11. Otaki, K., Ikeda, M., Yamamoto, A.: Pattern structures for understanding episode
    patterns. In: Proc. of the 11th CLA. pp. 47–58 (2014)
12. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed item-
    sets for association rules. In: Prof. of the 7th ICDT. pp. 398–416 (1999)
13. Pei, J., Han, J., Mao, R.: CLOSET: An efficient algorithm for mining frequent
    closed itemsets. In: Proc. of the DMKD2000. pp. 21–30 (2000)
14. Smets, K., Vreeken, J.: Slim: directly mining descriptive patterns. In: Proceedings
    of the 20th SDM. pp. 236–247 (2012)
15. Tatti, N., Vreeken, J.: Discovering descriptive tile trees. In: Proc. of the
    ECML/PKDD 2012. pp. 24–28 (2012)
16. Tatti, N., Vreeken, J.: The long and the short of it: Summarising event sequences
    with serial episodes. In: Proc. of the 18th KDD. pp. 462–470 (2012)
17. Uno, T., Asai, T., Uchida, Y., Arimura, H.: LCM: An efficient algorithm for enu-
    merating frequent closed item sets. In: Proc. of the FIMI’03 (2003)
18. Vreeken, J., van Leeuwen, M., Siebes, A.: Krimp: Mining itemsets that compress.
    Data Mining and Knowledge Discovery 23(1), 169–214 (2011)
          An Approach Towards Classifying and
          Navigating RDF data based on Pattern
                       Structures

                         Mehwish Alam and Amedeo Napoli

                        LORIA (CNRS – Université de Lorraine)
                      BP 239, 54506 Vandoeuvre-les-Nancy, France
                       {mehwish.alam,amedeo.napoli@loria.fr}



         Abstract. With an increased interest in machine processable data, more
         and more data is now published in RDF (Resource Description Frame-
         work) format. This RDF data is present in independent and distributed
         resources which needs to be centralized, navigated and searched for do-
         main specific applications. This paper proposes a new approach based
         on Formal Concept Analysis (FCA) to create a navigation space over
         semantic web data. This approach uses an extension of FCA and takes
         RDF triples and RDF Schema present on several independent sources
         and provide centralized access over the data resulting from several re-
         sources. Afterwards, SPARQL queries can be posed over this navigation
         space to access these distributed resources from one platform for infor-
         mation retrieval purposes.


  Keywords: Formal Concept Analysis, Navigation Space, Linked Open Data,
  RDF, RDF Schema, Semantic Pattern Structures.


  1    Introduction
  With the efforts of Semantic Web community many technologies have been of-
  fered for publishing machine-readable data on web. It annotates textual data
  with meta-data and makes it available in the form of ontologies and RDF graphs.
  One of the emerging source of such data are published in the form of Linked Open
  Data (LOD) cloud [2]. As a contrast to textual resources, LOD does not need
  extensive preprocessing as it is already annotated in the form of entities and
  relationships.
      This structured format leads to other kind of challenges. One of the ba-
  sic characteristics of Semantic Web is that it follows a decentralized publica-
  tion model, meaning that the RDF graphs are published in several distributed
  resources, instead of creating one knowledge-base of statements any one can
  contribute new statements and make it publicly available. These resources have
  nothing in common except some shared terms. These decentralized graphs should
  be integrated through machine/software agents to provide domain specific ap-
  plications. Moreover, external schemas in the form of ontologies or taxonomies

c 2015 for the individual papers by the papers’ authors. Copying permitted for private
  and academic purposes. This volume is published and copyrighted by its editors.
  Local Proceedings also appeared in ISBN 978-84-606-7410-8, Universidad de Málaga
  (Dept. Matemática Aplicada), Spain.
34    Mehwish Alam and Amedeo Napoli


can be linked to these data to make sense based on real world conception. Some
of the resources may only consist of ontological information and may not con-
tain the instantial information such as SWRC ontology [11] and some resources
may only contain instantial information such as DBLP. The problem of how to
provide one platform for accessing several and related semantic web resources
to build domain specific applications still persists. This paper focuses on how
to link several related heterogeneous RDF resources present on distributed loca-
tions containing RDF data and RDF Schema into one space which can further
be navigated and searched by the end user. We call it a “navigation space”
because it provides centralization over RDF triples and also over RDF Schema
belonging to several resources. This space provides navigation and querying over
RDF triples as well as RDF Schema.
    The current paper introduces a new framework based on Formal Concept
Analysis [8] which focuses on how a navigation space is created over RDF data
by taking into account an existing RDF Schema to provide search capabilities
over distributed and heterogeneous Semantic Web resources. In the current study
we only consider a part of RDF Schema which keeps subclass-superclass rela-
tion. Other constructs such as sub-property and relation between classes are not
considered here. FCA alone may not be able to handle the complex data such as
RDF data and RDF Schema as it considers only binary data. To deal with this
shortcoming, we employ an extension of FCA, namely Pattern Structures [7]. We
propose a new framework called Semantic Pattern Structures, that takes RDF
triples and the RDF Schema as an input and produces a “navigation space”.
This navigation space provides centralization over distributed RDF resources
and keeps a partially ordered organization of the classes of RDF triples with
respect to RDF Schema from data sources which may or may not be part of
LOD. The lattice structure allows for querying over RDF graph w.r.t. subject,
predicate and object and makes use of the partial order relation between the
concepts. This enables simultaneous search over schema and the facts contained
in an RDF graph. This navigation space allows querying using well-documented
and well-know query language, SPARQL. Several data sources are simultane-
ously queried through one platform. To date this is the first attempt to deal
with RDF graph and RDF Schema using pattern structures.
    The paper is structured as follows: section 2 gives the motivating example,
section 3 introduces the basic notion of Pattern Structures with an example. Sec-
tion 4 details the framework of semantic pattern structures for creating schema
rich navigation space over RDF data. Section 5 defines the querying mechanism
over the obtained navigation space. Section 6 describes the experimental results
of the framework. Section 7 discusses the related work while Section 8 concludes
the paper.


2    Motivating Example

For instance, if a patient has been prescribed some Cardiovascular Agent (CVA)
and after a while he is looking for a replacement for his drug because it causes
      Towards Semantic Indexing over RDF data based on Pattern Structures        35


some allergic condition as a side effect. For finding solution, the doctor should
be able to access this information for obtaining the suggestions for replacing
patient’s drug which is used for the same purpose but does not cause the side
effect reported by the patient. All the required information is present in different
data sources:
 – Drugbank1 contains information about drugs with their categories and the
   proteins it targets.
 – SIDER2 contains drug with their specific side effects.
 – The category of drugs CVA exists in MeSH 3 .
 – Allergic conditions which is a class of side effects exists in MedDRA4 .
The information required by the domain expert cannot be obtained by posing a
simple query on Google. SPARQL queries over each resource should be written
to obtain this answer, which further needs manual work to obtain the required
suggestions as the answers are in the form of list. In order to access such kind of
information, all the data sets must be accessed simultaneously and there should
be a single information space for accessing all the desired information through
navigation and simple SPARQL queries.
    In the current study, we propose a generic framework that provides one space
for accessing several data sources related to some domain. The framework pro-
posed in this study is generic because it can be applied to any domain having
related information scattered over several resources. Based on the requirements
of the user and the applications, the navigation space can be defined on a limited
number of factual RDF triples. For example, in case of a cardiologist, he will only
be interested in Cardiovascular Agents or some related categories of drugs. This
way the application will include the drugs which are Cardiovascular Agents and
its related categories. Only a subset of datasets should be considered according
to the requirements of the applications. These specification are described by the
domain expert as a starting point for narrowing down the huge cloud of Linked
Data present on-line.


3     Preliminaries
3.1   Linked Open Data
Recently, Linked Open Data (LOD) [2] has become a standard for publishing
data on-line in the form of Resource Description Framework (RDF)5 which can
further be linked to other data sources published in the same format. The
idea underlying RDF is based on storing statements about a particular re-
source, where each statement is represented as xsubject, predicate, objecty. A
1
  http://www.drugbank.ca/
2
  http://sideeffects.embl.de/
3
  http://www.ncbi.nlm.nih.gov/mesh/
4
  http://meddra.org/
5
  http://www.w3.org/RDF/
36     Mehwish Alam and Amedeo Napoli


set of linked statements is referred to as RDF graph which constitutes an RDF
triple store. For instance, Table 2 keeps RDF triples for drugs with their side
effects and drug categories. The prefixes and full forms of all the abbreviations
used in this paper are shown in Table 1. Consider t1 i.e., xs1 , p1 , o1 y, here s1
is subject, p1 is predicate and o1 is the object. A URI U in an RDF graph
is a general form of URL. The actual representation of the drug s1 in the
form of URI is 1 http://wifo5-04.informatik.uni-mannheim.de/drugbank/
resource/drugs/DB006691 , which is a URI of the DrugBank provided by Uni-
versity of Mannheim containing all the information about the drug in the form
of triples. DB00669 is the id provided by DrugBank to drug s1 and the name of
the drug is obtained by using the rdfs:label predicate defined in RDFS vocab-
ulary6 . The subject denotes the resource and the predicate denotes properties of
the resource and defines relationship between the subject and the object. In the
rest of the paper we use dereferenced resources i.e., s1 instead of complete URI.
    Practically, we consider RDF data w.r.t. three levels according to the value
of predicate.

 – An RDF triple is at the schema level when the predicate has a reserved
   value from RDFS vocabulary such as rdfs:subClassOf, rdfs:Class (rep-
   resented as sc and Class respectively in the rest of the paper). The keyword
   rdfs:Class is used to declare that a set of resources is constituting a class
   and rdfs:subClassOf states that all the instances of one class are the in-
   stances of another class. For example, in Table 2, the schema level is given by
   triples t6, t8. An example of the tree structure contained in an RDF schema
   (in the rest of the paper when we use RDF Schema/schema we mean the
   tree structure created by rdfs:subClassOf) related to the drug side effect
   and their categories is shown in Figures 1 and 2 respectively.
 – An RDF triple is at the type level if it keeps the mappings from an instance
   to its class and states that a resource is an instance of a class. The RDF
   triple is at this level when the predicate has a reserved value from RDF
   vocabulary such as rdf:type (represented as type in the rest of the paper).
   For example, in Table 2, the type level is given by triples t4, t5, t7.
 – An RDF triple is at factual level when the predicate is related to the domain
   knowledge and is not a keyword from RDF vocabulary. For example, in
   Table 2, the factual level is given by t1, t2, t3.

    Table 2 keeps RDF triples from several resources. The information about the
side effect of the drugs shown in triples t1, t2 is present on SIDER7 database
which keeps the side effects of the drugs contained on the packaging of the drug.
The information about the categories of the drugs shown in triples t3 is present
on DrugBank8 , which keeps detailed information about each of the drugs and
the proteins it targets. The schema level information about the drug side effects
shown in t5, t6 are present in The Medical Dictionary for Regulatory Activities
6
  http://www.w3.org/TR/rdf-schema/
7
  http://sideeffects.embl.de/
8
  http://www.drugbank.ca/
        Towards Semantic Indexing over RDF data based on Pattern Structures                                      37

 Abbreviation Term                                           Abbreviation Term
 p1           http://wifo5-04.informatik.uni-mannheim.       p2           http://wifo5-04.informatik.uni-mannheim.
              de/sider/resource/sider/sideEffect                          de/drugbank/resource/drugbank/category
 s1           Sumatriptan                                    C1           Acute Coronary Syndrome
 s2           Tadalafil                                      C4           Setevens-Johnsons Disorders
 s3           Theophylline                                   C2           Tachycardia
 s4           Ticlopidine                                    C3           Urticaria
 s5           Vardenafil                                     C5           Erythmia Multiforme
 D1           Fibrinolytic Agents                            C6           Coronary Artery Disorders
 D2           Vasoconstrictor Agents                         C7           Cardiac Arrhythmias
 D3           Vasodilator Agents                             C8           Urticarias
 D4           Fibrin Modulating Agents                       C9           Allergic conditions NEC
 D5           Cardiovascular Agents                          C10          Allergic conditions
 D6           Molecular Mechanisms of Pharmacological Action C11          Cardiac Disorders
 D7           Therapeutic Uses                               C12          Immune System Disorders

       Table 1: This table keeps the abbreviations of the terms used in the rest of the paper.

                                   tid Subject Predicate Object Provenance
                                   t1 s1       p1        o1     SIDER
                                   t2 s1       p1        C2     SIDER
                                   t3 s1       p2        D2     DrugBank
                                   t4 s1       type      Drug DrguBank
                                   t5 o1       type      C1     MedDRA
                                   t6 C1       sc        C6     MedDRA
                                   t7 D2       type      D5     MeSH
                                   t8 D5       sc        D7     MeSH
                                   ... ...     ...       ...    ...

                  Table 2: RDF triples for the Drug Domain from several resources.

(MedDRA) Terminology9 which is the international medical terminology. Finally
the schema level information related to the drug category shown in triples t7, t8
is present in MeSH (Medical Subject Headings)10 is a controlled vocabulary
thesaurus which along with other tree structures keeps classification of drug
categories.

3.2     SPARQL
A standard query language for accessing RDF graphs is SPARQL11 which mainly
focuses on graph matching. A SPARQL query is composed of two parts the head
and the body. The body of the query contains the Basic Graph Patterns (it is
contained in the WHERE clause of the query). It is composed of complex graph
patterns that may include RDF triples with variables, conjunctions, disjunc-
tions and constraints over the values of the variables. These graph patterns are
matched against the RDF graph and the matched graph is retrieved and manip-
ulated according to the conditions given in the query. The head of the query is an
expression which indicates how the answers of the query should be constructed.
For instance, consider a simple SPARQL query SELECT ?x ?y WHERE { ?x p1
?y}, then the basic graph pattern ?x p1 ?y will be matched against the triples
containing p1 as predicate i.e., t1 and t2 (see Table 2). Thus the answer of the
query will be (s1 , o1 ), (s1 , C2 ).
9
   http://www.meddra.org/
10
   http://www.ncbi.nlm.nih.gov/mesh
11
   http://www.w3.org/TR/rdf-sparql-query/
38       Mehwish Alam and Amedeo Napoli


3.3     Pattern Structures
Formal Concept Analysis [8] can process only binary context, more complex
data such as graphs, RDF triples can not be directly processed by FCA. Some
of the complex data require scaling for these to be considered as binary data.
The concept lattice obtained by a binary context mixes between several types of
attributes. Pattern structures [7], an extension of FCA, allows direct processing
of such kind of complex data such as RDF triples (as we show in this paper).
The pattern structures were introduced in [7].
    A pattern structure is a triple pG, pD, [q, δq, where G is the set of entities12 ,
pD, [q is a meet-semilattice of descriptions D and δ : G Ñ D maps an entity
to a description. More intuitively, a pattern structure is the set of objects with
their descriptions with a similarity operation [ on them which represents the
similarity of objects. This similarity measure is idempotent, commutative and
associative. If pG, pD, [q, δq is the pattern structure then the derivation operators
can be defined as:

                                        ę
                                A˝ :“         δpgq   f or A Ď G
                                        gPA

                         d˝ :“ tg P G|d Ď δpgqu      f or d P D

Now the pattern concept can be defined as follows:

Definition 1 (Pattern Concept). A pattern concept of a pattern structure
“G, pD, [q, δ” is a pair pA, dq where A Ď G and d P D such that A˝ “ d and
A “ d˝ , where A is called the concept extent and d is called the concept intent.

   Finally, the obtained concept lattice is called as pattern concept lattice, which
keeps partially ordered relations between the pattern concepts.


4      Towards Semantic Pattern Structures
The decentralization issue is raised when many entities are contributing state-
ments and making it publicly available. Either they are publishing same data
using different vocabularies or related data which is distributed over several
resources. To provide centralized access over distributed resources this section
discusses a framework called Semantic Pattern Structures. This new framework
is called Semantic Pattern Structures (SPS) because it is based on Pattern Struc-
tures and it deals with data in the form of triples and it classifies these triples
with respect to RDF Schema present in the form of taxonomy. This way the final
navigation space also keeps the semantic information i.e., background knowledge.
The idea underlying the SPS is to create a concept lattice directly from an RDF
graph and use the associated RDF Schema present on Linked Open Data to
12
     We rename an object in Pattern Structures as an entity to avoid confusion with the
     object in an RDF triple.
      Towards Semantic Indexing over RDF data based on Pattern Structures                39


be accessed within one concept lattice. This concept lattice provides search and
navigation capabilities over RDF data as well as over the RDF Schema to answer
certain queries of the domain expert. The obtained lattice does not only provides
access to several data sources but is also a materialization of the associated RDF
Schema.
    For creating one navigation space over RDF as well as RDF Schema, the
first task is to define RDF triples in terms of patterns structures i.e., specifying
the entities, their descriptions and the mapping from entities to description.
The second task is to select a suitable RDF Schema associated to these RDF
triples from distributed data sources based on domain knowledge. After these two
preprocessing steps, we define a similarity measure [ over two descriptions which
we generalize to the set of descriptions. After defining the similarity measure,
we explain how a semantic pattern concept lattice is built using this similarity
measure. Finally, we interpret, provide navigation and querying capabilities over
the obtained pattern concept lattice.


4.1   From RDF Triples to Semantic Pattern Structures

Firstly, we represent factual RDF triples about certain domain as entities and
their descriptions. According to section 3.3, pattern structures are given as
pG, pD, [q, δq. The descriptions D are termed as semantic descriptions and are
denoted as Ds . An RDF triple store containing factual level information consists
of statements of the form xsi , pj , ok y, where i “ t1, . . . , nu, j “ t1, . . . , qu, k “
t1, . . . , mu. Then, the set of subjects si are mapped to the entities in G. As the
set of entities G represent the subjects in triples, we represent it as S.
    Regarding the type of each object set which is the range of same predicate
a suitable RDF Schema is obtained by locating the terms in the closely re-
lated RDF Schema. This selection is done based on domain knowledge. RDF
Schema contains many constructs such as property, sub-property etc. along with
sc and information but in this work we use the RDF Schema predicates such
as type and sc introduced in section 3.1. First, the mapping from object to
its type is obtained such as xok , type, C1 y i.e., ok is an instance of class C1 .
Each object is then replaced by its corresponding class i.e., ok is replace by
C1 . Now, the considered taxonomy information used for defining the similarity
measured is present in the RDF Schema by following the predicates such as
rdf s : subClassOf, skos : broader e.g., xC1 , sc, C6 y, i.e., C1 is subclass of C6 .
Note that we do not extract the trees, we follows this predicate while defining
the similarity in the next section.
    For instance, to obtain generic drug categories such as cardiovascular agents
or generic side effects such as allergic conditions we further add the class infor-
mation related to each of the objects in the triples. For example, (ok ,type,C2 )
meaning that ok is an instance of class C2 . Afterwards, we select all the super
classes of the current class in the RDF Schema. For example, for C2 , we have
C7 , C11 and C13 . An RDF Schema for drug side effect and drug category is shown
in Figure 1 and Figure 2 respectively.
40          Mehwish Alam and Amedeo Napoli


                                                                                  D9

                               C13
                                                                                  D8

               C11                            C12

                                                                  D6                        D7
       C6               C7                    C10


                                                                  D4                        D5
       C1               C2               C8         C9



                                         C3    C4        C5       D1                   D2        D3


      Fig. 1: RDF Schema for Side Effects (MedDRA) Fig. 2: RDF Schema for Drug Category (MeSH)
      pT1 q.                                       pT2 q.


    The pair ppj : Ck q gives a description dij P Ds . The mapping of entities to
description δ : S Ñ Ds is given as follows: let si P S then δpsi q “ tdi1 , . . . , diq u
where i P t1, . . . , nu and dij “ tpj : tC1 , C4 , . . . , Cm uu where j P t1, . . . , qu.
Finally, the semantic pattern structures are given as pS, pDs , [q, δq. Consider the
running scenario described before. As the drugs in the DrugBank and SIDER are
same and are the subjects then the triples t1, t2, t3 in Table 2 are represented as
pS, pDs , [q, δq where the entity S has the description δps1 q = {(p1 : tC1 , C2 , C3 u),
(p2 : tD2 u) }. The descriptions are based on same subjects having different
descriptions belonging to different resources i.e., side effect and category. Finally,
with respect to the triples in Table 2, the subjects in the RDF triples are mapped
to the entities, S “ {s1 , s2 , . . . }, the descriptions are given as Ds = {(p1 :
tC1 , C2 , C3 , . . . }) , (p2 : tD1 , D2 , . . . u)}.


                              Entities S Descriptions Ds
                              s1         {(p1 : tC1 , C2 , C3 u), (p2 : tD2 u)}
                              s2         {(p1 : tC1 , C4 , C5 u), (p2 : tD3 u)}
                              s3         {(p1 : tC2 u), (p2 : tD3 u)}
                              s4         {(p1 : tC3 , C4 , C5 u), (p2 : tD8 u)}
                              s5         {(p1 : tC1 , C3 u), (p2 : tD2 u)}
                     Table 3: RDF Triples as entities S and semantic descriptions Ds .


4.2     Similarity Operation ([) over Descriptions

For defining the similarity operation over the semantic description Ds , we use
the notion of least common subsumer (lcs). Here we re-write the definition of a
least common subsumer of two classes in a RDF Schema T given in [9].

Definition 2 (Least Common Subsumer). Given a partially ordered set
pS, ďq, a least common subsumer E of two classes C and D (lcs(C,D) for short)
in a partially ordered set is a class such that C Ď E and D Ď E and E is least
i.e., if there is a class E 1 such that C Ď E 1 and D Ď E 1 then E Ď E 1 .
      Towards Semantic Indexing over RDF data based on Pattern Structures                    41


Example 1. Let us consider C3 and C5 . Then according to the RDF Schema in
Figure 1, the common subsumers of C3 and C5 are C10 , C12 , C13 and lcspC3 , C5 q “
C10 .


Similarity Operation over Set of Descriptions: For describing the similar-
ity over set of descriptions, LCS is computed pairwise for classes in two different
sets of description and then only the most specific classes are picked.
     Let s1 and s2 be two entities such that s1 , s2 P S and the mapping δ is given
as follows δps1 q “ d1 “ tp1 : tC1 , C2 , C3 uu and δps2 q “ d2 “ tp1 : tC1 , C4 , C5 uu.
The similarity is always computed w.r.t the same predicate, here p1 . A dual
similarity measure is defined to ensure classification of triples with respect to
objects as well as with respect to classes from RDF Schema. The similarity
measure is computed based on the taxonomy given in Figure 1. According to
the current approach a pairwise similarity measure is computed between the two
sets of classes connected through some predicate. For example, δps1 q [ δps2 q “
xp1 : tC1 , C2 , C3 uy [ xp1 : tC1 , C4 , C5 uy. Meaning that δps1 q [ δps2 q “ p1 :
xlcspC1 , C1 q, lcspC1 , C5 q, . . . y. Then, δps1 q [ δps2 q “ p1 : tC1 , C10 , C11 , C13 u. As
we have C1 ď C11 ď C13 and C10 ď C13 then we pick only the specific classes
i.e., C1 and C10 . Finally, δps1 q [ δps2 q “ p1 : tC1 , C10 u.
     Now let us consider the complete descriptions of the two entities i.e., δps1 q “
{(p1 : tC1 , C2 , C3 u), (p2 : tD2 u)} and δps2 q “ {(p1 : tC1 , C4 , C5 u), (p2 : tD3 u)}.
Then, δps1 q [ δps2 q “ {(p1 : tC1 , C2 , C3 u), (p2 : tD2 u)} [ {(p1 : tC1 , C4 , C5 u),
(p2 : tD3 u)}. The set of least common subsumers for p1 will be C1 , C10 . The
least common subumer between p2 : tD3 u and p2 : tD2 u is p2 : tD5 u. Finally,
the similarity between these two drugs will be {(p1 : tC1 , C10 u), (p2 : tD5 u)}.
     The proposed similarity measure fulfills the property of a similarity operator
i.e., commutative, associative and idempotent [7]. The number of RDF Schema
considered are equal to number of predicates in the entity-description representa-
tion of an RDF graph. In the running example, we use two schemas to show that
our method can also be easily applied to multiple RDF Schemas which are either
independent or parts of the same schema distributed over several resources.


4.3    Building the Semantic Pattern Concept Lattice

For building the semantic pattern concept lattice using the above similarity
measure over several RDF Schemas, according to the running example, δps1 q [
δps2 q “ tpp1 : tC1 , C10 uq, pp2 : tD5 uqu. The semantic pattern concept lattice is
built according to definition 1. For instance, ts1 , s2 ul “ tpp1 : tC1 , C10 uq, pp2 :
tD5 uqu but tpp1 : tC1 , C10 uq, pp2 : tD5 uqu l “ ts1 , s2 , s5 u. This way we obtain
K#2 (see Figure 3) with 3 drugs giving the concept a support of 3. Finally,
the pattern concept lattice is built for Table 3 with respect to T1 (Figure 1) and
T2 (Figure 2). This pattern concept lattice is shown in Figure 3 and the details
of each of the pattern concept is shown in Table 4.
42      Mehwish Alam and Amedeo Napoli


                                             K#8



                                     K#4              K#9



                            K#5      K#6     K#2     K#10      K#11



                                                     K#13



                            K#7      K#3              K#1      K#12



                                             K#0



                          Fig. 3: Pattern Concept lattice for Drugbank


             K#ID Extent                   Intent
             K#1 ts1 u                     (p1 : tC1 , C2 , C3 u), (p2 : tD2 u)
             K#2 ts1 , s2 , s5 u           (p1 : tC1 , C10 u), (p2 : tD5 u)
             K#3 ts2 u                     (p1 : tC1 , C4 , C5 u), (p2 : tD3 u)
             K#4 ts1 , s2 , s3 , s5 u      (p1 : tC11 u), (p2 : tD5 u)
             K#5 ts1 , s3 u                (p1 : tC2 u), (p2 : tD5 u)
             K#6 ts2 , s3 u                (p1 : tC11 u), (p2 : tD3 u)
             K#7 ts3 u                     (p1 : tC2 u), (p2 : tD3 u)
             K#8 ts1 , s2 , s3 , s4 , s5 u (p1 : tC13 u), (p2 : tD8 u)
             K#9 ts1 , s2 , s4 , s5 u      (p1 : tC10 u), (p2 : tD8 u)
             K#10 ts1 , s4 , s5 u          (p1 : tC3 u), (p2 : tD8 u)
             K#11 ts2 , s4 u               (p1 : tC4 , C5 u), (p2 : tD8 u)
             K#12 ts4 u                    (p1 : tC3 , C4 , C5 u), (p2 : tD9 u)
             K#13 ts1 , s5 u               (p1 : tC1 , C3 u), (p2 : tD2 u)
                    Table 4: Details of Pattern Concept lattice in Figure 3




4.4    Interpreting and Navigating the Pattern Concept Lattice

This pattern concept lattice not only serves as a navigation space but also pro-
vides clustering of RDF triples w.r.t. some Schema. Each of the pattern concepts
is a cluster of triples. For instance, consider the simplest example in Table 4 i.e.,
K#1, where there is only one entity in the extent, then this cluster contains the
triples tps1 , p1 , C1 q, ps1 , p2 , D2 q, . . . u. Here it can be seen that all the classes are
connected to single subject S through two predicates. This space can further be
navigated to provide the required information to the user without the techni-
cal background of Semantic Web. Let us consider, if user is looking for drugs
causing side effect C10 , the located pattern concept will be K#9 which contains
all the drugs causing some allergic conditions. Afterwards, if the user is looking
for some very specific allergic condition such as C3 then he can simply navigate
downwards to obtain more specific concept i.e., K#10 which contains all the
      Towards Semantic Indexing over RDF data based on Pattern Structures             43


drugs which cause the side effect C3 . Let us consider the scenario described in
section 2, here the user will be looking for cardiovascular agents (D5 ) which do
not cause any allergic conditions. In this case first the concept containing all
the cardiovascular agents are located i.e., K#4 afterwards the lattice is further
navigated downwards to obtain the drugs causing C10 (allergic conditions) i.e.,
K#2. Now the drugs which are in K#4 and not in K#2 are the CVAs which
do not cause allergic conditions.


5     SPARQL Queries over Pattern Concept Lattice

For the user with the basic knowledge of SPARQL queries, this navigation space
can be easily accessed by posing simpler queries. This sub-lattice gives additional
information related to the query (detailed below). The obtained pattern concept
lattice keeps the materialized RDF Schema meaning that it does not require the
query to perform reasoning. It also allows for accessing several resources from
one platform. This navigation space allows queries which are subject, predicate
or object bound. In this section we describe simple as well as complex SPARQL
queries that can be posed over the this navigation space. Simple queries refer to
the queries with simple Basic Graph Patterns such a single triple pattern in the
WHERE clause of the SPARQL query, on the other hand, complex queries allow
joins and negations.


5.1   Simple Queries

Simple queries include subject, predicate or object bound queries. Subject-bound
queries refer to the queries where the value of the subject in an RDF triple
xs, p, oy is known while the values of predicate, object and classes of objects
are to be retrieved. Let a SPARQL query be SELECT ?p ?o WHERE {si ?p ?o},
where si is a known subject and a pattern concept be pA, dq such that si P A.
The BGP in this query is (si ?p ?o), this BGP is matched against the RDF
graph present in the pattern concept lattice. The answer of the query will be
the object concept tsi ul “ tdi u where di P d where d is the intent of the
concept pA, dq. Note that this object concept is already computed while building
a pattern concept lattice. In other words, all the predicate-object pairs tdi u
connected to the subject si in the RDF graph G are returned as a result. For
example, if user wants to know complete information about the drug s1 i.e.,
its side effect and the category, then SELECT ?p ?o WHERE {s1 ?p ?o}. It will
search for object concept in the pattern concept lattice given in Figure 3. For
instance, sl 1 “ pp1 : tC1 , C2 , C3 u, p2 : tD2 uq (K#1) and the answer would
be pp1 : tC1 , C2 , C3 u, p2 : tD2 uq. The first part of the answer pp1 : tC1 , C2 , C3 u
belongs to SIDER while p2 : tD2 u belongs to DrugBank. However, because of the
strong lattice-structure a sub-lattice is retrieved by navigating to more general
concepts (upward navigation) present in the pattern concept lattice i.e., all the
concepts connected to K#1 i.e., K#2, K#4, K#5, K#9, K#10, K#13, K#8.
Here, K#1 is the focus concept because it keeps the exact answer to the posed
44     Mehwish Alam and Amedeo Napoli


SPARQL query and the rest of the concepts contain additional information.
This way the user not only gets the desired answer but also some additional
information about the drug i.e., the classification of side effects from MedDRA
and the features it shares with other drugs and with which drugs it shares certain
features. Object-bound queries are answered in the same way as described above
by retrieving the attribute concept and the linked concepts. However, in this case
the additional information will be retrieved by navigating downwards from the
focus concept because it is the object-bound query and objects are contained in
the intent of the pattern concepts.


5.2   Queries with Joins

Now let us consider a slightly more complex query which includes a join and
a class resources to be retrieved. If it is specified that the drugs to be re-
trieved should be Cardiovascular Agent causing some Allergic Condition. Then
the SPARQL query with subject-subject join will be as follows:

SELECT ?drug WHERE {
?drug p2 D5 .
?drug p1 C10 }

     For obtaining the answer, the BGP in the above query is matched against
the concept which contains both the predicates p1 and p2 , along with this pred-
icate it keeps the classes of objects CardiovascularAgents (D5 ) connected to
predicate p2 as well as AllergicConditions (C10 ) connected to the predicate
p1 . The answer will be ts1 , s2 , s5 u i.e., K#2, which will be the focus concept.
However, to obtain more specific information regarding these classes of drug
categories and the side effects will be returned by navigating downwards in
the concept lattice. Finally, a sub-lattice containing K#2, K#3, K#13, K#1 is
returned as an answer to the above query. In this case, the user will have an ad-
ditional information regarding more specific categories related to Cardiovascular
Agents causing Allergic Conditions such as according to K#13 drugs ts1 , s5 u are
D2 and K#3 ts2 u is D3 . Similarly, specific allergic conditions are also retrieved.


5.3   Queries with Filtering Criteria

Now let us move towards the scenario where the doctor is looking for drug re-
placement for the patient having heart conditions based on its side effects. For
replacing the drugs, the system can be queried by locating all the CVAs (this
information comes from DrugBank, MeSH in the current navigation space) and
then excluding all the CVAs causing Allergic Conditions. The related SPARQL
query can be given as follows:

SELECT ?drug WHERE {
?drug p2 D5 .
       Towards Semantic Indexing over RDF data based on Pattern Structures     45


?drug p1 ?se
FILTER (?se != C10 ) }
     The condition is given in the filter clause because SPARQL does not support
negation operation directly. In order to do so, all the drugs which are CVAs are
retrieved by SPARQL and then it filters the drugs which do not cause allergic
conditions. Same is the case with the lattice operation where concept K#4 is
selected because it contains all the cardiovascular agents and then a more spe-
cific concept is obtained containing the CVAs causing allergic conditions i.e.,
K#2 and minus operator is applied to perform the desired filtering. The answer
obtained is s3 .
     In a sense, the navigation space follows the query mechanism provided by
Google where a query is posed by the user and this query is mapped to a pre-
existing cluster of documents. In the same way, when a query is posed by the
user, our query mechanism maps the query to a cluster of RDF triples, where
each element in an RDF triple keeps the URI of a web-page.


6      Experimentation

The current approach was applied to biomedical data. This section details the
experimental evaluation for the creation of the navigation space The proposed
algorithm was coded in C++ and the experiments was performed using 3GB
RAM on Ubuntu version 12.04.


6.1     Drug Search

The datasets containing information about drugs are DrugBank, SIDER, Med-
DRA and MeSH as described in section 3.1. DrugBank keeps detailed informa-
tion about each of the drugs, their categories and the proteins it targets. The
other database is SIDER which keeps the side effects of the drugs contained
on the packaging of the drug. The access to these two data sets is provided by
University of Mannheim through two different endpoints1314 .
    The schema associated to the side effects of the drug is available on BioPor-
tal [12], which is a web portal that provides access to a repository of biomedical
ontologies. BioPortal is developed by National Center for Biomedical Ontology
(NCBO) [10]. The Medical Dictionary for Regulatory Activities (MedDRA) Ter-
minology is the international medical terminology. During this experiment we
will enrich side effects with schema level information using MedDRA terminol-
ogy. In case of the drug categories MeSH vocabulary thesaurus was taken into
account. MeSH (Medical Subject Headings) is a controlled vocabulary thesaurus.
The drug categories from DrugBank will be enriched with the tree numbers from
MeSH vocabulary. The tree numbers arrange the terms from MeSH in a hier-
archical manner known as MeSH Tree Structures. In the current experiment we
13
     http://wifo5-04.informatik.uni-mannheim.de/drugbank/snorql/
14
     http://wifo5-04.informatik.uni-mannheim.de/sider/snorql/
46     Mehwish Alam and Amedeo Napoli


used the MeSH vocabulary already present in the form of RDF in Bio2RDF [1, 5],
which makes the public databases related to Bioinformatics such Kegg, MeSH,
HGNC etc. available in the RDF format.
   During this experiment, two subsets of the dataset were considered. Both
belonging to two major classes of drugs i.e., Cardiovascular Agents (CVA) and
Central Nervous System (CNS).

6.2   Evaluation
In the following, we study the scalability of Semantic Pattern Structures over
large dataset. Table 5 precises the statistics of the data. Pattern concept lattices
over both the chosen data sources was built in 0-25 seconds for the maximum of
31098 triples. Figure 4b shows the variation in the size of the navigation space
for both data sets. The navigation space contains a maximum of around 500000
clusters of triples which were generated in 25 seconds. However, there are several
ways to reduce these number of concepts. The filtering based on the depth of
classes considered in the taxonomy which allows the reduction in the number of
clusters while generating the concept lattice and hence causes decrease in the
runtime of creating the navigation space. Most of these very general classes are
not very interesting for the domain expert. Moreover, there are several measures
such as support, stability and lift which allow post-filtering of the navigation
space.


               Datasets        No. of Triples No. of Subjects No. of Objects Runtime
         Cardiovascular Agents     31098            145             927       0-22 sec
        Central Nervous System     22680            105            1050       0-25 sec
                   Table 5: Statistics of two datasets and navigation space.




                     Fig. 4: Size of the Navigation Space for each dataset.
     Towards Semantic Indexing over RDF data based on Pattern Structures         47


7   Related work
In [6], the author focuses on allowing conceptual navigation to RDF graphs,
where each concept is accessed through SPARQL-like queries. However, in our
case several RDF graphs are considered and we use the already existing, well-
established and well-documented query language, SPARQL. Moreover, [3] in-
troduces ontological pattern structures for enriching raw data with EL ontolo-
gies. But both the approaches consider only one resource at a time, hence not
targeting the problem of decentralization. As a contrast, our approach provide
navigation space over RDF graphs as well as schema level information from sev-
eral resources allowing user to access information from one platform. In [4], the
authors introduce a new clause Categorize By which clusters SPARQL query
answers w.r.t to background knowledge present as ontologies. Like our approach,
only taxonomy is used for enriching, however, unlike our approach if clusters the
answers after obtaining the query answers. As a contrast, our approach provides
classification of the RDF triples as well as RDF Schema before hand. After-
wards, SPARQL queries are mapped to the existing clusters and the answer is
shown to the user. In such a case, the query response time is faster than the
Categorize By clause. Moreover, as it provides clustering over answers only, it
lacks the capability to provide user with additional information.

8   Conclusion and Future Work
This paper proposes a new approach for navigating semantic web data and tar-
gets the capabilities of FCA to deal with RDF data. It provides navigational
and search capabilities over RDF triples as well as RDF Schema distributed
over several resources. This paper proposes a new similarity measure for pattern
structures to deal with RDF data as well as RDF Schema simultaneously, termed
as Semantic Pattern Structures. The pattern concepts in the concept lattice are
considered as clusters of of RDF triples which can then be navigated and queried
by the user.

References
 1. François Belleau, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, and
    Jean Morissette. Bio2rdf: Towards a mashup to build bioinformatics knowledge
    systems. Journal of Biomedical Informatics, 41(5):706–716, 2008.
 2. Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data - the story so far.
    Int. J. Semantic Web Inf. Syst., 5(3):1–22, 2009.
 3. Adrien Coulet, Florent Domenach, Mehdi Kaytoue, and Amedeo Napoli. Using
    pattern structures for analyzing ontology-based annotations of biomedical data.
    In 11th International Conference on Formal Concept Analysis,, 2013.
 4. Claudia d’Amato, Nicola Fanizzi, and Agnieszka Lawrynowicz. Categorize by:
    Deductive aggregation of semantic web query results. In Lora Aroyo, Grigoris An-
    toniou, Eero Hyvönen, Annette ten Teije, Heiner Stuckenschmidt, Liliana Cabral,
    and Tania Tudorache, editors, ESWC (1), volume 6088 of Lecture Notes in Com-
    puter Science, pages 91–105. Springer, 2010.
48     Mehwish Alam and Amedeo Napoli


 5. Michel Dumontier, Alison Callahan, Jose Cruz-Toledo, Peter Ansell, Vincent
    Emonet, François Belleau, and Arnaud Droit. Bio2rdf release 3: A larger, more
    connected network of linked data for the life sciences. In Posters & Demonstrations
    Track ISWC., 2014.
 6. Sébastien Ferré. Conceptual navigation in RDF graphs with sparql-like queries.
    In Formal Concept Analysis, 8th International Conference, ICFCA 2010, Agadir,
    Morocco, March 15-18, 2010. Proceedings, pages 193–208, 2010.
 7. Bernhard Ganter and Sergei O. Kuznetsov. Pattern structures and their projec-
    tions. In Harry S. Delugach and Gerd Stumme, editors, ICCS, volume 2120 of
    Lecture Notes in Computer Science, pages 129–142. Springer, 2001.
 8. Bernhard Ganter and Rudolf Wille. Formal Concept Analysis: Mathematical Foun-
    dations. Springer, Berlin/Heidelberg, 1999.
 9. Ralf Küsters and Ralf Molitor. Computing least common subsumers in ALEN. In
    Proceedings of the Seventeenth International Joint Conference on Artificial Intel-
    ligence, IJCAI, pages 219–224, 2001.
10. Mark A. Musen, Natalya Fridman Noy, Nigam H. Shah, Patricia L. Whetzel,
    Christopher G. Chute, Margaret-Anne D. Storey, and Barry Smith. The national
    center for biomedical ontology. JAMIA, 19(2):190–195, 2012.
11. York Sure, Stephan Bloehdorn, Peter Haase, Jens Hartmann, and Daniel Oberle.
    The swrc ontology - semantic web for research communities. In EPIA, volume 3808
    of Lecture Notes in Computer Science, pages 218–231. Springer, 2005.
12. Patricia L. Whetzel, Natalya Fridman Noy, Nigam H. Shah, Paul R. Alexander,
    Csongor Nyulas, Tania Tudorache, and Mark A. Musen. Bioportal: enhanced func-
    tionality via new web services from the national center for biomedical ontology to
    access and use ontologies in software applications. Nucleic Acids Research, 39(Web-
    Server-Issue):541–545, 2011.
         Subdirect Decomposition of Contexts into
              Subdirectly Irreducible Factors

      Jean-François Viaud1 , Karell Bertet1 , Christophe Demko1 , Rokia Missaoui2
                    1
                        Laboratory L3i, University of La Rochelle, France
                           {jviaud, kbertet, cdemko}@univ-lr.fr
                         2
                           University of Québec in Outaouais, Canada
                                   rokia.missaoui@uqo.ca



          Abstract. The size of a concept lattice may increase exponentially with
          the size of the context. When the number of nodes is too large, it becomes
          very difficult to generate and study such a concept lattice. A way to avoid
          this problem is to break down the lattice into small parts. In the subdirect
          decomposition, the small parts are factor lattices which are meaningful
          in the Formal Concept Analysis (FCA) setting.
          In this paper a walkthrough from a finite reduced context to its subdi-
          rect decomposition into subdirectly irreducible subcontexts and factors
          is given. The decomposition can be reached using three different points
          of view, namely factor lattices, arrow relations and compatible subcon-
          texts. The approach is mainly algebraic since it is based on abstract
          lattice theory, except for the last point which is inherited from FCA. We
          also propose a polynomial algorithm to generate the decomposition of
          an initial context into subcontexts. Such a procedure can be extended to
          conduct an interactive exploration and mining of large contexts, includ-
          ing the generation of few concepts and their neighborhood.

          Keywords: concept lattice, congruence relation, factor lattice, arrow
          relation, arrow closed subcontext, compatible subcontext


  1     Introduction

  During the last decade, the computation capabilities have promoted Formal
  Concept Analysis (FCA) with new methods based on concept lattices. Though
  they are exponential in space/time in worst case, concept lattices of a reason-
  able size enable an intuitive representation of data organized by a context that
  links objects to attributes through a binary relation. Methods based on concept
  lattices have been developed in various domains such as knowledge discovery
  and representation, database management and information retrieval where some
  relevant concepts, i.e. possible correspondences between objects and attributes
  are considered either as classifiers, clusters or representative object/attribute
  subsets.
      With the increasing size of data, a set of methods have been proposed in
  order to either generate a subset (rather than the whole set) of concepts and

c 2015 for the individual papers by the papers’ authors. Copying permitted for private
  and academic purposes. This volume is published and copyrighted by its editors.
  Local Proceedings also appeared in ISBN 978-84-606-7410-8, Universidad de Málaga
  (Dept. Matemática Aplicada), Spain.
50     Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui


their neighborhood (e.g. successors and predecessors) in an online and interac-
tive way [8, 20] or better display lattices using nested line diagrams [13]. Such
approaches become inefficient when contexts are huge. However, the main idea of
lattice/context decomposition into smaller ones is still relevant when the classifi-
cation property of the initial lattice is maintained. Many lattice decompositions
have been defined and studied, both from algebraic [6, 16] and FCA points of
view [13, 12]. We can cite Unique Factorisation Theorem [16], matrix decompo-
sition [2], Atlas decomposition [13], subtensorial decomposition [13], doubling
convex construction [5, 17, 14, 3] and subdirect decomposition. The latter has
been widely studied many years ago, in the field of universal algebra [6, 11],
and even in FCA [21–24, 12]. To the best of our knowledge, there is no new
development or novel algorithms for subdirect decomposition of contexts.
    In this paper we investigate the subdirect decomposition of a concept lattice
as a first step towards an interactive exploration and mining of large contexts.
The subdirect decomposition of a lattice L into factor lattices (Li )i∈{1,...,n} , de-
noted by L ,→ L1 ×· · ·×Ln , is defined by two properties (see important results in
[13]): (i) L is a sublattice of the direct product L1 ×· · ·×Ln , and (ii) each projec-
tion of L onto a factor is surjective. The first property establishes that each factor
lattice is the concept lattice of an arrow-closed subcontext, i.e. closed accord-
ing to the arrow relation between objects and attributes. This means that the
decomposition can be obtained by computing specific subcontexts. The second
property states that there is an equivalence between arrow-closed subcontexts
and congruence relations of L, i.e., an equivalence relation whose equivalence
classes form a lattice with elements closed by the meet and join operations. This
means that the concepts of L can be retrieved from the factor lattices, and the
classification property of L is maintained since each equivalence relation forms
a partition of the elements. The last result establishes an equivalence between
arrow-closed subcontexts and compatible subcontexts, i.e. subcontexts such that
each concept corresponds to a concept of the initial lattice. This result gives a
way to compute the morphism from L into the direct product, and thus to re-
trieve the concepts of L from the factor lattices. In this paper, we deduce from
these results strong links between the following notions that have not been used
yet together as far as we know:

 – Factors of a subdirect decomposition,
 – Congruence relations,
 – Arrow-closed subcontexts and
 – Compatible subcontexts.

   As suggested in [13], the contexts of the factors of a particular subdirect
decomposition, namely the subdirectly irreducible subcontexts, can be obtained
by a polynomial processing of each row/object of the initial context. Therefore,
the subdirect decomposition of a lattice can be extended to a subdirect decom-
position of its reduced context into subdirect and irreducible subcontexts.
   In this paper, we propose a subdirect and polynomial decomposition of a
context into subcontexts by extending the subdirect decomposition of a lattice
      Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors   51


into factor lattices. This decomposition leads to data storage saving of large
contexts. Indeed, the generation of the whole set of factor lattices can be avoided
by providing an interactive generation of a few (but not all) concepts and their
neighborhood from large contexts. Moreover, a focus on a specific factor lattice
can be proposed to the user by generating, partially or entirely, the concept
lattice and/or a basis of implications.
    There are at least two reasons for studying this case of pattern manage-
ment. The first one comes from the fact that users tend to be overwhelmed
by the knowledge extracted from data, even when the input is relatively small.
The second reason is that the community of FCA has made progress in lattice
construction and exploration, and hence existing solutions can be adapted and
enriched to only target useful patterns (i.e. pieces of knowledge).
    This paper is organized as follows. Section 2 introduces the subdirect de-
composition and the three different points of view, namely factor lattices, arrow
relations and compatible subcontexts. Section 3 contains the main contribution
of this paper about the subdirect decomposition and the proposed algorithms.
A preliminary empirical study is presented in Section 4 while Section 5 presents
future work.


2     Structural framework
Throughout this paper all sets (and thus lattices) are considered to be finite.

2.1    Lattices and Formal Concept Analysis
Algebraic lattice Let us first recall that a lattice (L, ≤) is an ordered set in
which every pair (x, y) of elements has a least upper bound, called join x ∨ y,
and a greatest lower bound, called meet x ∧ y. As we are only considering finite
structures, every subset A ⊂ L has a join and meet (e. g. finite lattices are
complete).

Concept or Galois Lattice A (formal) context (O, A, R) is defined by a set
O of objects, a set A of attributes, and a binary relation R ⊂ O × A, between O
and A. Two operators are derived:
 – for each subset X ⊂ O, we define X 0 = {m ∈ A, j R m ∀j ∈ X} and dually,
 – for each subset Y ⊂ A, we define Y 0 = {j ∈ O, j R m ∀m ∈ Y }.
    A (formal) concept represents a maximal objects-attributes correspondence
by a pair (X, Y ) such that X 0 = Y and Y 0 = X. The sets X and Y are respec-
tively called extent and intent of the concept. The set of concepts derived from
a context is ordered as follows:
                  (X1 , Y1 ) ≤ (X2 , Y2 ) ⇐⇒ X1 ⊆ X2 ⇐⇒ Y2 ⊆ Y1
   The whole set of formal concepts together with this order relation form a
complete lattice, called the concept lattice of the context (O, A, R).
52     Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui


    Different formal contexts can provide isomorphic concept lattices, and there
exists a unique one, named the reduced context, defined by the two sets O and
A of the smallest size.
    This particular context is introduced by means of special concepts or elements
of the lattice L, namely irreducible elements.
    An element j ∈ L is join-irreducible if it is not a least upper bound of
a subset not containing it. The set of join irreducible elements is noted JL .
Meet-irreducible elements are defined dually and their set is ML . As a direct
consequence, an element j ∈ L is join-irreducible if and only if it has only one
immediate predecessor denoted j − . Dually, an element m ∈ L is meet-irreducible
if and only if it has only one immediate successor denoted m+ .
    In Figure 1, join-irreducible nodes are labelled with a number and meet-ir-
reducible nodes are labelled with a letter.




                    Fig. 1. A lattice with its irreducible nodes




Fundamental Bijection A fundamental result [1] establishes that any lattice
(L, ≤) is isomorphic to the concept lattice of the context (JL , ML , ≤), where JL
and ML are the join and meet irreducible concepts of L, respectively. Moreover,
this context is a reduced one.
    As a direct consequence, there is a bijection between lattices and reduced con-
texts where objects of the context are associated with join-irreducible concepts
of the lattice, and attributes are associated with meet-irreducible concepts.
    Figure 2 shows the reduced context of the lattice in Figure 1.
      Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors   53


                                      bcd f g j
                                     2xxxxx
                                     3x x xx
                                     5xxx
                                     6xx x
                                     9 x x

                 Fig. 2. The reduced context of the lattice in Figure 1




2.2    Compatible and Arrow-closed Subcontexts

This section is dedicated to the equivalence between compatible and arrow-closed
subcontexts.


Compatible subcontexts A subcontext of a formal context (O, A, R) is a triple
(J, M, R ∩ J × M ) such that J ⊂ O and M ⊂ A. A subcontext (J, M, R ∩ J × M )
of (O, A, R) is compatible if for each concept (H, N ) of (O, A, R), (J ∩ H, M ∩ N )
is a concept of (J, M, R ∩ J × M ).


Arrow relations The arrow-closed subcontexts involved in the equivalence
are based on the arrow relations between join and meet irreducible concepts of
a lattice. Consider the reduced context (JL , ML , ≤) of a lattice (L, ≤). Arrow
relations [4, 15] form a partition of the relation 6≤ (defined by not having x ≤ y)
by considering the immediate predecessor j − of a join-irreducible j, and the
unique immediate successor m+ of a meet-irreducible m:

 – j l m if j 6≤ m, j ≤ m+ and j − ≤ m.
 – j ↑ m if j 6≤ m, j ≤ m+ and j − 6≤ m.
 – j ↓ m if j 6≤ m, j 6≤ m+ and j − ≤ m.
 – j ◦ m if j 6≤ m, j 6≤ m+ and j − 6≤ m.

    In Figure 3, the reduced context of Figure 2 is enriched with the four relations
l, ↑, ↓, and ◦ in the empty cells that both correspond to the case where j 6≤ m:


                                     b c d f g j
                                   2××××× l
                                   3× l × ↓ ××
                                   5××× l l ◦
                                   6×× l × ↓ ◦
                                   9 l × ◦ × ◦ ◦

                                Fig. 3. Arrow relation
54    Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui


    As an illustration, let j = 5 and m = f be join-irreducible and meet-
irreducible nodes respectively (see Figure 1). Then, j − = 2 and m+ = c. The
relation 5 l f holds since 5 6≤ f , 5 ≤ c and 2 ≤ f .


Arrow-closed subcontext A subcontext (J, M, R ∩ J × M ) of a context
(O, A, R) is an arrow-closed subcontext when the following conditions are met:

 – If j ↑ m and j ∈ J then m ∈ M
 – If j ↓ m and m ∈ M then j ∈ J

    As an example, the first subcontext of Figure 4 is an arrow-closed subcontext
of the reduced context of Figure 3 whereas the second one is not, due to the
down-arrow 6 ↓ g.


                cd f g                                     cdfg
               3 x x                                      3 x x
               5xx                                        5xx
               6x x


 Fig. 4. Arrow-closed and non-arrow-closed subcontexts of the context in Figure 3




Equivalence theorem First let us introduce the first equivalence we need in
this paper, whose proof can be found in [13]:

Theorem 1. Let (J, M, R ∩ J × M ) be a subcontext of (O, A, R). The following
propositions are equivalent:

 – The subcontext (J, M, R ∩ J × M ) is a compatible one.
 – The subcontext (J, M, R ∩ J × M ) is an arrow-closed one.


2.3   Congruence Relations and Factor Lattices

In this section, we introduce the equivalence between congruence relations and
arrow-closed subcontexts.


Quotient An equivalence relation is a binary relation R over a set E which is
reflexive, symmetric, and transitive. An equivalence class of x ∈ E is:

                              xR = {y ∈ E | xRy}

The set of equivalence classes, called the quotient set E/R, is:

                              E/R = {xR | x ∈ E}
      Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors         55


Factor lattice A congruence relation Θ on a lattice L is an equivalence relation
such that:

            x1 Θy1 and x2 Θy2 =⇒ x1 ∧ x2 Θy1 ∧ y2 and x1 ∨ x2 Θy1 ∨ y2

The quotient L/Θ verifies the following statement:

                      xΘ ≤ yΘ ⇐⇒ xΘ(x ∧ y) ⇐⇒ (x ∨ y)Θy

With such an order, L/Θ is a lattice, called factor lattice. A standard theorem
from algebra, whose proof is omitted, states that:

Theorem 2. The projection L → L/Θ is a lattice morphism onto.


The second equivalence theorem We are now able to formulate the second
equivalence whose proof can be found in [13]:

Theorem 3. Given a lattice L, the set of congruence relations on L corresponds
bijectively with the set of arrow-closed subcontexts of the reduced context of L.

   Congruence relations will be computed with this theorem. However, other
algorithms exist [9, 10].


2.4    Subdirect decompositions

In this section, we introduce the equivalence between subdirect decompositions
and sets of arrow-closed subcontexts.


Subdirect product

Definition 1. A subdirect product is a sublattice of a direct product L1 ×· · ·×Ln
of lattices Li , i ∈ {1, . . . , n} such that each projection onto a factor is surjective.
The lattices Li , i ∈ {1, . . . , n} are the factor lattices. A subdirect decomposition
of a lattice L is an isomorphism between L and a subdirect product which can be
denoted as:
                                   L ,→ L1 × · · · × Ln  Li


The third equivalence theorem The third and most important equivalence
whose proof can be found in [13], makes a connection with sets of arrows-closed
subcontexts when they cover the initial context:

Proposition 1. Given a reduced context (O, A, R), then the subdirect decom-
positions of its concept lattice L correspond bijectively to the families of arrow-
closed subcontexts (Jj , Mj , R ∩ Jj × Mj ) with O = ∪Jj and A = ∪Mj .
56     Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui


3     Our contribution
3.1   Main Result
From the three previous equivalences found in [13], we deduce the following one:

Corollary 1. Given a lattice L and its reduced context (O, A, R), we have an
equivalence between:

1. The set of arrow-closed subcontexts of (O, A, R),
2. The set of compatible subcontexts of (O, A, R),
3. The set of congruence relations of L and their factor lattices.

Corollary 2. Given a lattice L and its reduced context (O, A, R), we have an
equivalence between:

1. The families of arrow-closed subcontexts of (O, A, R) covering O and A,
2. The families of compatible subcontexts of (O, A, R) covering O and A,
3. The families (θi )i∈I of congruence relations of L such that ∩i∈I θi = ∆ with
   x∆y ⇐⇒ x = y.
4. The set of subdirect decompositions of L and their factor lattices.

   In the following, we exploit these four notions that, to the best of our knowl-
edge, have not been put together yet.

1. The subdirect decomposition ensures that L is a sublattice of the factor
   lattice product. Moreover, each projection from L to a factor lattice is sur-
   jective.
2. The congruence relations of L indicate that factor lattices correspond to
   their quotient lattices, and thus preserve partitions via equivalence classes.
3. The compatible subcontexts give a way to compute the morphism from L
   onto its factors.
4. Arrow-closed subcontexts enable the computation of the reduced context of
   the factor lattices.

   In the following we present the generation of a particular subdirect decom-
position and show a possible usage of factor lattices.

3.2   Generation of Subdirectly Irreducible Factors
In this section, we consider subdirect decompositions of a lattice L with its re-
duced context (O, A, R) as input. From Corollary 2, a subdirect decomposition
of a lattice L can be obtained by computing a set of arrow-closed subcontexts of
(O, A, R) that have to cover O and A. There are many sets of arrow-closed sub-
contexts and thus many subdirect decompositions. In particular, the decomposi-
tion from a lattice L into L itself is a subdirect decomposition, corresponding to
the whole subcontext (O, A, R) which is clearly arrow-closed. A subdirect decom-
position algorithm has been proposed in [12]. However, all congruence relations
     Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors   57


are computed and then only pairs of relations are formed to get a decomposi-
tion. As a consequence, potentially multiple decompositions are produced with
necessarily two factors.
    In this article, we focus on the subdirect decomposition of a context into a
possibly large number of small factors, i.e. factors that cannot be subdirectly
decomposed. A factor lattice L is subdirectly irreducible when any subdirect
decomposition of L leads to L itself. A nice characterization of subdirectly irre-
ducible lattices can be found in [13]:

Proposition 2. A lattice L is subdirectly irreducible if and only if its reduced
context is one-generated.

   A context (O, A, R) is one-generated if it can be obtained by arrow-closing
a context with only one j ∈ J. Thus (O, A, R) is the smallest arrow-closed
subcontext containing j ∈ J.
   Therefore, we deduce the following result:

Proposition 3. Let L be a lattice. From L, we can deduce a product lattice
L1 × ... × Ln where each lattice Li is:

 – the concept lattice of a one-generated subcontext,
 – subdirectly irreducible,
 – a factor lattice of the subdirectly irreducible decomposition.

    From this result, we propose an algorithm (Algorithm 1) to compute in poly-
nomial time the contexts of the factor lattices L1 , . . . , ...Ln of a subdirectly
irreducible decomposition, with a reduced context (O, A, R) as input. The one-
generated subcontexts for each j ∈ J are obtained by arrow-closing (Algorithm
2). The subdirectly irreducible decomposition of L can then be obtained by
computing the concept lattices of these subcontexts.
    One can notice that the closure computed from join-irreducible concepts can
also be calculated from meet-irreducible concepts.
 Algorithm 1: Subdirect Decomposition
   Input: A context (O, A, R)
   Output: List L of the contexts (Jj , Mj , Rj ) of the subdirectly irreducible
            factor lattices
 1 L ← ∅;
 2 forall the j ∈ O do
 3    Compute (Jj , Mj , Rj ) = Arrow Closure((j, ∅, ∅), (O, A, R)), the
      one-generated subcontext span by j;
 4    if L does not contain any subcontext that covers (Jj , Mj , Rj ) then
 5        add (Jj , Mj , Rj ) to L
 6      if L contains a subcontext covered by (Jj , Mj , Rj ) then
 7          delete it from L
 8      return L;
58    Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui


 Algorithm 2: Arrow Closure
                         ˜ M̃ , R̃) of a context (J, M, R)
   Input: A subcontext (J,
   Output: Arrow-closure of (J,˜ M̃ , R̃)
        ˜
 1 Jc = J; Mc = M̃ ;
 2 predJ = 0; predM = 0;
 3 while predM < card(Mc ) or predJ < card(Jc ) do
 4    predJ = card(Jc );
 5    predM = card(Mc );
 6    forall the j ∈ Jc do
 7        add to Mc all m ∈ M such that j ↑ m;
 8     forall the m ∈ Mc do
 9        add to Jc all j ∈ J such that j ↓ m;
10     Return (Jc , Mc , R ∩ Jc × Mc )



                   j              Arrow Closure             Con-
                                                            tained
                                                            in L
                             Input           Output
                            ˜ M̃ , R̃)
                           (J,           Jc         Mc
                   2        (2, ∅, ∅)   {2}         {j}      ×
                   3        (3, ∅, ∅)   {3}         {c}
                   5        (5, ∅, ∅) {3, 5, 6} {c, d, f, g} ×
                   6        (6, ∅, ∅)   {6}         {d}
                   9        (9, ∅, ∅)   {9}         {b}      ×


       Fig. 5. Iterations of Algorithm 1 for the reduced context in Figure 2



    Consider the reduced context in Figure 2. Each iteration of Algorithm 1 is
described by Figure 5 for each value of j, the input and output of Algorithm
2, and the three one-generated subcontexts that belong to L at the end of the
process. Therefore we get three factor lattices (see Figure 6).
    The first subcontext is the one on the left of Figure 4. The two other ones
are: ({2}, {j}, ∅) and ({9}, {b}, ∅). The latter two subcontexts are interesting
because:

 – They show that the initial lattice has parts which are distributive. Indeed,
   these two subcontexts contain exactly one double arrow in each line and each
   column.
 – They give us a dichotomy: any concept contains either 2 or j ; any concept
   contains either 9 or b
 – In the reduced context, an arrow brings a deeper knowledge than a cross.
      Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors        59




               (a)              (b) First subcontext in           (c)
               ({2}, {j}, ∅)    Figure 4                          ({9}, {b}, ∅)

Fig. 6. The three factor lattices of the decompostion with their subcontext as caption


    The context on the left hand-side of Figure 4 is tricky to understand. For
the other ones, we have a simple relation 2 l j or 9 l b, which means that, for
instance, 2 and j are some kind of complement or converse.
    Figure 7 shows a factor lattice and its corresponding congruence.




                         Fig. 7. Factor lattice and congruence




3.3    Onto Morphism and FCA
A subdirect decomposition of a lattice L into factor lattices L1 , . . . , Ln is relevant
since there exists an into morphism from L to the product lattice L1 × . . . × Ln .
This morphism is specified by the bijection between compatible subcontexts and
congruence relations stated by Corollary 1:
60      Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui


Proposition 4. Let (J, M, R ∩ J × M ) be a compatible subcontext, then the
relation ΘJ,M defined by:

       (A1 , B1 )ΘJ,M (A2 , B2 ) ⇐⇒ A1 ∩ J = A2 ∩ J ⇐⇒ B1 ∩ M = B2 ∩ M

is a congruence relation, and its factor lattice is isomorphic to the concept lattice
of the subcontext (J, M, R ∩ J × M ).
    Algorithm 3 computes this morphism: each concept of L is computed as the
product of concepts in each factor, and then marked in the product lattice.
From Algorithm 3, we get the large tagged lattice product shown in Figure 8.
Obviously, this algorithm is not intended to be used in a real application with
large contexts since the product lattice is much more bigger than the original
one, while the main goal of the decomposition is to get smaller lattices. We only
use this algorithm in the empirical study.
    Nevertheless, this morphism can be extended to develop basic FCA pro-
cessing. Once the subdirectly irreducible decomposition of a reduced context
(O, A, R) into the contexts C1 , . . . , Cn is computed, an interactive exploration
and mining process can easily be considered by using the following basic tasks
and avoiding the generation of the lattice for the whole context (O, A, R):

 – Compute the smallest concept of L that contains a given subset of objects
   or attributes, and identify its neighborhood
 – Compute the smallest concept cij and its neighborhood in a subset of factors
   that contain a given collection of objects or attributes. Each factor Li is a
   specific view of data.



    Algorithm 3: Into morphism
   Input: Initial lattice L;
   Subcontexts (Jj , Mj , Rj );
   Product lattice P = L1 × . . . × Ln
   Output: Product lattice P with nodes coming from L marked.
 1 forall the c = (A, B) ∈ L do
 2    forall the (Jj , Mj , Rj ) do
 3         Compute (A ∩ Jj , B ∩ Mj );
 4      Mark, in P , the product node Πj (A ∩ Jj , B ∩ Mj );




4     Experiments
In this section, we conduct an empirical study in order to better understand the
impact of the subdirect decomposition on contexts with different density levels.
All the tests were done using the java-lattices library available at:
    Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors          61




                          Fig. 8. Ugly tagged lattice product



                    http://thegalactic.github.io/java-lattices/
    A thousand of contexts of 10 observations and 5 attributes were randomly
generated. The experiments have been done three times using three density
values, namely 20%, 50% and 80%. Figure 9 presents the number of generated
lattices according to their size and their density. We can observe two histograms
with a nice gaussian shape in the first two cases, but a strange behavior in
the last case. However, we roughly observe that the higher the density is, the
bigger the lattices are. Therefore, the context density will have an impact on the
decomposition process.




     (a) Density of 20%           (b) Density of 50%             (c) Density of 80%

                          Fig. 9. Number of lattices per size.


    The number of generated factors is given in Figure 10. One can observe
that this number increases with the density of the initial context since the corre-
sponding lattices have more edges. With a context density of 20%, we get 87.40%
irreducible contexts. However, with a density of 80%, only 1.50% of contexts are
irreducible and 70% of contexts have at least 4 factors in their decomposition.
Thus, lattices are more decomposable when they come from dense contexts.
62     Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui


                Factors Density=20% Density=50% Density=80%
                1         87.40%      70.50%       1.50%
                2          9.70%      21.40%       9.90%
                3          2.60%       6.20%      18.70%
                4          0.30%       0.50%      23.40%
                5                      1.40%      28.50%
                6                                 18.00%

Fig. 10. Proportion of the number of factors in the subdirect decomposition of the
contexts according to their density




    Let us now examine the size of factors with two different density values,
namely 20% and 50%. Figure 11 gives the number of cases (initial lattices)
according to the number of produced factors. Of course, we observe that smaller
lattices give rise to smaller factors. We can also see that in these two density
cases, the largest number is obtained for factors of size 2.




                    (a) Density 20%                 (b) Density 50%

      Fig. 11. Number of cases according to the number of generated factors.



    The last part of the tests aims at using contexts with a large density of 80%
and computing the ratio between the number of nodes (edges resp.) of the initial
lattice and the number of nodes (edges resp.) of the product lattice.
    We thus get Figure 12 which shows that the higher is the number of factors,
the bigger is the product lattice. Consequently, the set of useless (void) nodes in
the product becomes larger as the number of factors increases.
    We have also conducted experiments with 40 observations and 15 attributes,
and a hundred of contexts were generated using two density values: 20% and
50%. Unfortunately, all were subdirectly irreducible.
    Subdirect Decomposition of Contexts into Subdirectly Irreducible Factors         63


                           Factors % of nodes % of edges
                           1         100%       100%
                           2        97.90%     97.20%
                           3        89.20%     84.60%
                           4        85.00%     77.40%
                           5        80.60%     71.80%
                           6        68.00%     57.30%

                  Fig. 12. Proportion of untagged nodes and edges



5    Conclusion and future work
In this paper, we have presented a polynomial algorithm for the decomposition
of a reduced context into subcontexts such that the concept lattice of each sub-
context is a subdirectly irreducible factor. This decomposition is a direct conse-
quence inferred from strong links between factors of a subdirect decomposition,
congruence relations, arrow-closed subcontexts and compatible subcontexts es-
tablished in [13].
     To further investigate the subdirect decomposition, it would be interesting to
conduct large-scale experiments on real world data not only to confirm/nullify
the preliminary empirical tests but also to understand the semantics behind
the generated irreducible contexts. In particular, attributes covered by several
factors interfere in different views of data whose semantics must be interesting
to understand. Moreover, it would be useful to allow the user to interactively
select a few factors of the decomposition by mixing our approach with the one
in [12].
     From a theoretical point of view, we think that there are strong links between
the implication basis in a quotient lattice and the one from the initial lattice. To
the best of our knowledge, this issue has never been addressed and could have
significant algorithmic impacts. However, we note that [19] tackle a similar issue
in case of a vertical decomposition of a context into subcontexts.
     Since the empirical study in [18] show that many real-life contexts are subdi-
rectly irreducible, we plan to (i) identify cases in which a context is necessarily
irreducible, and (ii) study, compare and combine other decompositions, in par-
ticular the Fratini congruence [7], and the reverse doubling convex construction
[5, 17, 14, 3]. Finally, the construction of a lattice from its factor lattices can be
done based on the optimization principles behind the relational join operation
in databases.

References
 1. Barbut, M., Monjardet, B. (eds.): L’ordre et la classification. Algèbre et combina-
    toire, tome II, Hachette (1970)
 2. Belohlavek, R., Vychodil, V.: Discovery of optimal factors in binary data via a
    novel method of matrix decomposition. Journal of Computer and System Sciences
    76(1), 3–20 (Feb 2010)
64     Jean-François Viaud, Karell Bertet, Christophe Demko, and Rokia Missaoui


 3. Bertet, K., Caspard, N.: Doubling convec sets in lattices: characterizations and
    recognition algorithms. Tech. Rep. TR-LACL-2002-08, LACL (Laboratory of Al-
    gorithms, Complexity and Logic), University of Paris-Est (Paris 12) (2002)
 4. Crawley, P., Dilworth, R.: Algebraic theory of lattices. Prentice Hall, Englewood
    Cliffs (1973)
 5. Day, A.: Congruence normality: The characterization of the doubling class of con-
    vex sets. algebra universalis 31(3), 397–406 (1994)
 6. Demel, J.: Fast algorithms for finding a subdirect decomposition and interesting
    congruences of finite algebras. Kybernetika (Prague) 18(2), 121–130 (1982)
 7. Duquenne, V.: Lattice drawings and morphisms. In: Formal Concept Analysis,
    8th International Conference, ICFCA 2010, Agadir, Morocco, March 15-18, 2010.
    Proceedings. pp. 88–103 (2010)
 8. Ferré, S.: Reconciling Expressivity and Usability in Information Access from File
    Systems to the Semantic Web. Ph.D. thesis, Univeristy Rennes 1 (2014)
 9. Freese, R.: Computing congruence lattices of finite lattices. Proceedings of the
    American Mathematical Society 125(12), 3457–3463 (1997)
10. Freese, R.: Algorithms in finite, finitely presented and free lattices. Preprint, July
    22, 159–178 (1999)
11. Freese, R.: Computing congruences efficiently. Algebra universalis 59(3-4), 337–343
    (2008)
12. Funk, P., Lewien, A., Snelting, G.: Algorithms for concept lattice decomposition
    and their applications. Tech. rep., TU Braunschweig (December 1995)
13. Ganter, B., Wille, R.: Formal concept analysis - mathematical foundations.
    Springer (1999)
14. Geyer, W.: The generalized doubling construction and formal concept analysis.
    algebra universalis 32(3), 341–367 (1994)
15. Grätzer, G.: General lattice theory. Birkhäuser-Verlag, Basel (1978)
16. Mihók, P., Semanis̃in, G.: Unique factorization theorem and formal concept anal-
    ysis. In: Yahia, S., Nguifo, E., Belohlavek, R. (eds.) Concept Lattices and Their
    Applications, Lecture Notes in Computer Science, vol. 4923, pp. 232–239. Springer
    Berlin Heidelberg (2008)
17. Nation, J.: Alan day’s doubling construction. algebra universalis 34(1), 24–34
    (1995)
18. Snelting, G.: Concept lattices in software analysis. In: Formal Concept Analysis,
    Foundations and Applications. pp. 272–287 (2005)
19. Valtchev, P., Duquenne, V.: On the merge of factor canonical bases. In: Inter-
    national Conference on Formal Concept Analysis ICFCA, pp. 182–198. Springer
    Berlin Heidelberg (2008)
20. Visani, M., Bertet, K., Ogier, J.M.: Navigala: an Original Symbol Classifier Based
    on Navigation through a Galois Lattice. International Journal on Pattern Recog-
    nition and Artificial Intelligence (IJPRAI) (2011)
21. Wille, R.: Subdirekte produkte und konjunkte summen. Journal für die reine und
    angewandte Mathematik 0239 0240, 333–338 (1969)
22. Wille, R.: Subdirekte Produkte vollständiger Verbände. J. reine angew. Math.
    283/284, 53–70 (1976)
23. Wille, R.: Subdirect decomposition of concept lattices. Algebra Universalis 17,
    275–287 (1983)
24. Wille, R.: Subdirect product construction of concept lattices. Discrete Mathematics
    63(2-3), 305–313 (1987)
      Context-Dependent Deductive and Inductive
      Reasoning Based on Good Diagnostic Tests

                      Xenia Naidenova1 and Vladimir Parkhomenko2
                  1
                   Military Medical Academy, Saint-Petersburg, Russia
                                  ksennaidd@gmail.com
      2
        Peter the Great St. Petersburg Polytechnic University, St. Petersburg, Russia
                                parhomenko.v@gmail.com



         Abstract. A sketch of classification reasoning is given in the paper.
         The key ideas of the reasoning are ideas of classification and its good
         approximations based on good diagnostic tests. Such good tests, which
         are maximally redundant (GMRTs), i.e. their subsets of attributes are
         closed, are considered. Classification reasoning embraces two interrelated
         processes: inductive inferring implicative assertions based on GMRTs and
         using these assertions in deductive steps of reasoning. A peculiarity of
         our inductive model of classification reasoning is the control and trans-
         formation of subcontexts during inductive phase of reasoning by the use
         of deductive reasoning rules, for example the rules prohibiting some pairs
         of attributive values.

         Keywords: good diagnostic test, classification reasoning, essential at-
         tributes, essential objects, implications, subcontexts, deduction, induc-
         tion, closed sets


  1    Introduction
  Classification reasoning is a part of commonsense (plausible) reasoning based
  on two kinds of logical rules extracted from observable datasets (functional and
  implicative dependencies). The principle concepts of this reasoning are the con-
  cepts of classification and its good approximations. We assume that objects are
  described by a set U of symbolic or numeric attributes. To give a target clas-
  sification of objects, we use an additional attribute KL not belonging to U . A
  target attribute partitions a given set of objects into disjoint classes the number
  of which is equal to the number of values of this attribute. In Tab. 1, we have
  two classes: KL+ (positive objects) and KL− (negative objects).
       We concentrate on the case of object classification into two classes and task of
  concept formation [2,5]. The goal of this task is to describe/classify new objects
  according to description/classification of existing objects. Inferring good diag-
  nostic tests (GDTs) is the formation of the best descriptions of a given object
  class KL+ against the objects not belonging to this class (KL− ).
       Let M = {∪dom(attr), attr ∈ U }, where dom(attr) is the set of all values
  of attr. Let X ⊆ M and G be the set of indices of objects considered (objects

c 2015 for the individual papers by the papers’ authors. Copying permitted for private
  and academic purposes. This volume is published and copyrighted by its editors.
  Local Proceedings also appeared in ISBN 978-84-606-7410-8, Universidad de Málaga
  (Dept. Matemática Aplicada), Spain.
66     Xenia Naidenova and Vladimir Parkhomenko

                   Table 1. Motivating Example of classification


                    No Height Color of Hair Color of Eyes KL
                    1   Low    Blond         Blue         KL+
                    2   Low    Brown         Blue         KL−
                    3   Tall   Brown         Hazel        KL−
                    4   Tall   Blond         Hazel        KL−
                    5   Tall   Brown         Blue         KL−
                    6   Low    Blond         Hazel        KL−
                    7   Tall   Red           Blue         KL+
                    8   Tall   Blond         Blue         KL+




for short), G = G+ ∪ G− , where G+ and G− the sets of positive and negative
objects, respectively. Denote by d(g) the description of object g ∈ G. Let P(X) =
{g | g ∈ G, X ⊆ d(g)}. We call P(X) the interpretation of X in the power set
2G . If P(X) contains only positive objects and the number of these objects more
than 2, then we call X a description of some positive objects and (P(X), X) a
test for positive objects.
    Let us define a good test or good description of objects.
Definition 1. A set X ⊆ M of attribute values is a good description of positive
(negative) objects if and only if it is the description of these objects and no such
subset Y ⊆ M exists, that P(X) ⊂ P(Y ) ⊆ G+ .
It can be seen [15,11,1] that this problem is reduced to searching for causal
dependencies in the form X → v, X ⊆ M, v is value of attribute KL for all
positive (negative) objects.
    Sec.2 describes an approach to classification reasoning based on implications.
Sec.3 describes the definitions of Good Test Analysis. Sec.4 presents the de-
composition of inferring good tests into two kinds of subtasks. An approach to
selecting and ordering subcontexts is given in Subsec.5.1. Some situations of re-
ducing the classification context are considered in Subsec.5.2. A set of examples
illustrates all the considerations.

2    Classification Reasoning
Classification reasoning based on GDTs embraces two interrelated processes:
inductive inferring implicative assertions and using them for pattern recognition
problems. We need the following three types of logical rules in order to realize
classification reasoning (deductive and inductive):
 – INSTANCES or relationships between objects or facts really observed. In-
   stance can be considered as a logical rule with the least degree of general-
   ization. On the other hand, instances can serve as a source of training set of
   positive and negative examples for inductive inference of generalized rules.
             Context-Dependent Reasoning Based on Good Diagnostic Tests         67


 – RULES OF THE FIRST TYPE or logical assertions. These rules describe
   regular relationships between objects and their properties, between proper-
   ties of different objects, between objects and their classes. The rules of the
   first type can be given explicitly by an expert or derived automatically from
   instances with the help of some learning process. These rules are represented,
   in the paper, in the form of implications.
 – RULES OF THE SECOND TYPE or commonsense reasoning rules with
   the help of which rules of the first type are used, updated, and inferred from
   data (instances). The rules of the second type embrace both inductive and
   deductive reasoning rules.

    The rules of the first type can be represented with the use of only one class
of logical statements based on implicative dependencies between names. Names
are used for designating concepts, things, events, situations, or any evidences.
They can be considered as attributes and attributes’ values in the formal repre-
sentations of logical rules.

1. Implication: a, b, c → d. This rule means that if the values standing on the
   left side of rule are simultaneously true, then the value on the right side of
   rule is always true.
2. Interdiction or forbidden rule (a special case of implication): a, b, c → f alse
   (never). This rule interdicts a combination of values enumerated on the
   left side of rule. The rule of interdiction can be transformed into several
   implications such as a, b →not c; a, c →not b; b, c →not a.
3. Compatibility: a, b, c → VA where VA is the frequency of occurrence of rule.
   The compatibility is equivalent to the following implications: a, b → c, VA ;
   a, c → b, VA; b, c → b, VA.
   Generally, the compatibility rule represents the most common combination
   of values, which is characterized by an insignificant number of exceptions
   (contrary examples) from regularity.
4. Diagnostic rule: x, d → a; x, b →not a; d, b → f alse. For example, d and b
   can be two values of the same attribute. This rule works when the truth of
   ’x’ has been proven and it is necessary to determine whether ’a’ is true or
   not. If ’x&d’ is true, then ’a’ is true, if ’x&b’ is true, then ’a’ is f alse.
5. Rule of alternatives: a or b → true (always); a, b → f alse. This rule says
   that a and b cannot be simultaneously true, either a or b can be true but
   not both. This rule is a variant of interdiction.

   Deductive steps of classification reasoning consist of inferring consequences
from some observed facts with the use of implications (i.e. knowledge). For this
goal, deductive rules of reasoning are applied the main forms of which are modus
ponens, modus tollens, modus ponendo tollens, and modus tollendo ponens. Let
x be a set of true values of some attributes observed simultaneously. Deductive
reasoning rules of the second type are:

1. Using implication: Let r be an implication, left(r) be the left part of r and
   right(r) be the right part of r. If left(r) ⊆ x, then x can be extended by
68     Xenia Naidenova and Vladimir Parkhomenko


    right(r) : x ← x ∪ right(r). Using implication is based on modus ponens: if
    A, then B; A; hence B.
 2. Using interdiction: Let r be an implication y →not k. If left(r) ⊆ x, then k
    is the forbidden value for all extensions of x. Using interdiction is based on
    modus ponendo tollens: either A or B (A, B — alternatives); A; hence not
    B; either A or B; B; hence not A.
 3. Using compatibility: Let r = ’a, b, c → k, VA’. If left(r) ⊆ x, then k can be
    used to extend x along with the calculated value VA for this extension (a
    special consideration is required for calculating VA).
 4. Using diagnostic rules: Let r be a diagnostic rule such as ’x, d → a; x, b →not
    a’, where ’x’ is true, and ’a’, ’not a’ are hypotheses or possible values of some
    attribute. Using the diagnostic rule is based on modus ponens and modus
    ponendo tollens. There are several ways for refuting one of the hypotheses:
      – To infer either d or b by using existing knowledge;
      – To involve new known facts (extended context) and/or propositions for
         inferring (with the use of inductive reasoning rules of the second type)
         new implications for distinguishing between the hypotheses ’a’ and ’not
         a’; to apply the newly obtained rules;
      – To get the direct answer on whether d or b is true by involving an external
         source of knowledge.
 5. Using rule of alternatives: Let ’a’ and ’b’ be two alternatives (mutually ex-
    clusive) hypotheses about the value of some attribute, and the truth of one
    of hypotheses has been established, then the second hypothesis is rejected.
    Using the rule is based on modus tollendo ponens: either A or B (A, B —
    alternatives); not A; hence B; either A or B; not B; hence A.
    We can call the rules listed above the rules of “forward inference”. However,
    we have in natural classification reasoning so called “backward inference”.
 6. Generating hypothesis or abduction rule. Let r be an implication y → k. We
    have “if k is true, then y may be true”.
 7. Using modus tollens. Let r be an implication y → k. If ’not k’ is inferred,
    then ’not y’ is also inferred.

    When applied, the above given rules generate the reasoning, which is not
demonstrative. The purpose of it is to infer all possible hypotheses of the value
of some objective attribute. It is essential that hypotheses do not contradict with
knowledge (first-type rules) and the observable real situation, where the reason-
ing takes place (contextual knowledge). Inference of hypotheses is reduced to
constructing all intrinsically consistent extensions of a set of values in which
the number of involved attributes is maximum possible (there are no prohibited
pairs of values in such extensions). All hypotheses have different admissibility
determined by the quantity and “quality” of compatibility rules involved in rea-
soning.
    A very simple structure of knowledge base (KB) and an example of applying
the rules of second type for a pattern recognition problem can be found in [19].
The process of deductive reasoning can require inferring new rules of the first
type from data when i) the result of reasoning contains several hypotheses and
              Context-Dependent Reasoning Based on Good Diagnostic Tests         69


it is impossible to choose one and only one of them (uncertainty), and ii) it is
impossible to infer any hypothesis.
     Inductive steps of classification reasoning consist in using already known
facts, observations and experimental results for inferring implications and their
correcting. The main forms of induction have been formulated by J. Stuart Mill
[12]. The formalization of these forms of induction has been offered by V.K. Finn
[6]. For inferring GDTs, the Combined Similarity and Difference Method of the
Mill are used and realized by means of inductive inference rules of the second
type proposed in [13].
     Deductive rules of the second type described above also work for inductive
inferring implications since implications obtained are involved immediately to
reduce the search space. In the paper, we shall illustrate using the forbidden and
diagnostic rules of the second type for GDTs inferring.

3   Key Notions of Good Test Analysis
Assume that G = 1, N be the set of objects and M = {m1 , m2 , . . . , mj , . . . mm }
be the set of values. The object descriptions are represented by rows of a table,
columns of which are associated with the attributes taking their values in M
(see, please, Tab.1).
    Denote by {Bg | Bg ⊆ M, g ∈ G} the description of an object. The Galois
connection [21] between the ordered sets (2G , ⊆) and (2M , ⊆), i.e. 2G → 2M and
2M → 2G , is defined by the following mappings called derivation operators (see
although different notations in [16,9]): for A ⊆ G and B ⊆ M , A0 = val(A) =
{intersection of all Bg | Bg ⊆ M, g ∈ A} and B 0 = obj(B) = {g| g ∈ G, B ⊆ Bg }.
We consider obj(B) = {intersection of all obj(m)| obj(m) ⊆ G, m ∈ B}. In the
Formal Concept Analysis, for g ∈ G and m ∈ M , {g}0 or simply g 0 is called
object intent, and {m}0 or simply m0 is called attribute extent [7].
    There are two closure operators [16]: generalization of(B) = B 00 = val(obj(B))
and generalization of(A) = A00 = obj(val(A)). A set A is closed if A = obj(val(A))
and a set B is closed if B = val(obj(B)). If A0 = B & B 0 = A, then a pair (A, B)
is called a formal concept, subsets A and B of which are called concept extent
and intent, respectively.
    A triple (G, M, I), where I is a binary relation G × M is called a formal
context K. Let K := (G , M, I ) and I := I ∩ (G × M ), where  ∈ {+, −} (one
can add the value τ if there is an undefined object) [8]. K± := (G+ ∪ G− , M ∪
{KL}, I+ ∪ I− ∪ (G± × {KL})).
Definition 2. A Diagnostic (classification) Test (DT) for G+ is a pair (A, B)
such that B ⊆ M , A = obj(B) 6= ∅, A ⊆ G+ , and B 6⊆ val(g), ∀g ∈ G− .
Equivalently, obj(B) ∩ G− = ∅.
Definition 3. A DT (A, B) for G+ is maximally redundant if obj(B ∪ m) ⊂ A
for all m ∈
          / B and m ∈ M .
Definition 4. A DT (A, B) for G+ is irredundant if any narrowing B∗ =
B \ m, m ∈ B implies that (obj(B∗ ), B∗ )) is not a test for G+ .
70     Xenia Naidenova and Vladimir Parkhomenko


Definition 5. A DT (A, B) for G+ is good if and only if any extension A∗ =
         / A, i ∈ G+ implies that (A∗ , val(A∗ )) is not a test for G+ .
A ∪ i, i ∈

    If a good test (A, B) for G+ is maximally redundant (GMRT), then any
extension B∗ = B ∪ m, m ∈   / B, m ∈ M implies that (obj(B∗ ), B∗ ) is not a good
test for G+ . The relation GMRTs to the Formal Concept Analysis is considered
in [17].
    Any object description d(g), g ∈ G in a given classification context is a maxi-
mally redundant set of values because ∀m ∈  / d(g), m ∈ M, obj(d(g) ∪ m) is equal
to ∅.
    In Tab.1, ((1, 8), Blond Blue) is a GMRT for KL+ , ((4, 6), Blond Hazel) is
a maximally redundant test for KL− but not a good one, and ((3, 4, 6), Hazel)
is a good maximally redundant and, simultaneously, good irredundant test for
KL− .


4    Decomposition of Inferring GMRTs into Subtasks

There are two kinds of subtasks for inferring GMRTs:

 1. given a set of values B, where B ⊆ M, obj(B) 6= ∅, B 6⊆ d(g), ∀g ∈ G− , find
    all GMRTs (obj(B∗ ), B∗ ) such that B∗ ⊂ B;
 2. given a non-empty set of values X ⊆ M such that (obj(X), X) is not a test
    for positive objects, find all GMRTs (obj(Y ), Y ) such that X ⊂ Y .

    For solving these subtasks we need to form subcontexts of a given classifica-
tion context K± . In the first case, we have to include in subcontext all positive
objects, the descriptions of which intersect B. The subtask is useful to find all
GMRTs intents of which are contained in the description d(g) of an object g.
This subtask is considered in [5] for fast incremental concept formation. The fol-
lowing notions of object and value projections are developed to form subcontexts
[18].

Definition 6. The projection proj(d(g)), g ∈ G+ on the set D+ , where D+ is
the set of descriptions of all positive objects, is denoted by {z| z = d(g) ∩ d(g∗ ) 6=
∅, d(g∗ ) ∈ D+ & (obj(z), z) is a test for G+ }. Note that d(g) ∈ proj(d(g)).

Definition 7. The value projection on a given set D+ proj(B) is {d(g) | B ⊆
d(g), d(g) ∈ D+ } or, equivalently, proj(B) = {d(g) | g ∈ (obj(B) ∩ G+ )}.

   The projections define the methods to construct two kinds of subcontexts of
K± . The following theorem gives the foundation of reducing subcontexts [14].

Theorem 1. Let X ⊆ M, (obj(X), X) be a maximally redundant test for pos-
itive objects and obj(m) ⊆ obj(X), m ∈ M . Then m can not belong to any
GMRT for positive objects different from (obj(X), X).
              Context-Dependent Reasoning Based on Good Diagnostic Tests            71


    Let splus(m) be obj(m)∩G+ (obj(m)∩G− ) and SPLUS={splus(m), m ∈ M }.
Consider an example of reducing subcontext in Tab. 1.
    For values Hazel, Brown, Tall, Blue, Blond, and Low, SPLUS={obj(m)∩G− }
= {{3, 4, 6}, {2, 3, 5}, {3, 4, 5}, {2, 5}, {4, 6}, {2, 6}}.
    We have val(obj(Hazel)) = Hazel, hence ((3,4,6), Hazel) is a test for G− .
Then value Blond can be deleted from consideration, because splus(Blond) ⊂
splus(Hazel).
    It is essential that the projection is a subset of object descriptions defined
on a certain restricted subset t∗ ⊆ M of values. Let s∗ be the subset of objects,
descriptions of which produce the projection. In the projection, splus(m) =
obj(m) ∩ s∗ , m ∈ t∗ .
    Let STGOOD be the partially ordered set of elements s such that (s, val(s))
is a good test for D+ .
    Forming the Set STGOOD. The important part of our basic recursive
algorithm is how to form the set STGOOD. Let L = (2S , ≤) be the set lattice
[22]. The ordering determined in the set lattice coincides with the set-theoretical
inclusion. Subset s1 is absorbed by subset s2 , i.e. s1 ≤ s2 , if and only if s1 ⊆ s2 .
Under formation of STGOOD, subset s ⊆ G+ is stored in STGOOD if and
only if it is not absorbed by any element of this set. It is necessary also to
delete from STGOOD all the elements that are absorbed by s if s is stored in
STGOOD. Thus, when the algorithm is over, the set STGOOD contains all the
collections of objects that are the extents of GMRTs and only such collections.
The set TGOOD of all the GMRTs is obtained as follows: TGOOD = {tg| tg =
(s, val(s)), s ∈ STGOOD}.
    The basic recursive procedure for solving any kind of subtasks has been
described in [20].


5    Selecting and Ordering Subcontexts and Inferring
     GMRTs
Algorithms of GMRTs inferring are constructed by the rules of selecting and
ordering subcontexts of the main classification context. Before entering into the
details, we give some new definitions.

Definition 8. Let t be a set of values such that (obj(t), t) is a test for G+ . The
value m ∈ M, m ∈ t is essential in t if (obj(t \ m), (t \ m)) is not a test for a
given set of objects.

Generally, we are interested in finding the maximal subset sbmax(t) ⊂ t such
that (obj(t), t) is a test but (obj(sbmax(t)), sbmax(t)) is not a test for a given
set of positive objects. Then sbmin(t) = t \ sbmax(t) is a minimal set of essential
values in t.

Definition 9. Let s ⊆ G+ , assume also that (s, val(s)) is not a test. The object
g, g ∈ s is said to be an essential in s if (s \ g, val(s \ g)) proves to be a test for
a given set of positive objects.
72     Xenia Naidenova and Vladimir Parkhomenko


    Generally, we are interested in finding the maximal subset sbmax(s) ⊂ s
such that (s, val(s)) is not a test but (sbmax(s), val(sbmax(s))) is a test for a
given set of positive objects. Then sbmin(s) = s \ sbmax(s) is a minimal set of
essential objects in s.
    Finding quasi-maximal (minimal) subsets of objects and values is the key
procedure behind searching for initial content of STGOOD and determining the
number of subtasks to be solved.


5.1   Using the Diagnostic Rule of the Second Type

An Approach for Searching for Initial Content of STGOOD. In the
beginning of inferring GMRTs, the set STGOOD is empty. Next we describe
the procedure to obtain an initial content of it. This procedure extracts a quasi-
maximal subset s∗ ⊆ G+ , which is the extent of a test for G+ (maybe not
good).
    We begin with the first object g1 of s∗ , then we take the next object g2
of s∗ and evaluate the function to be test({g1 , g2 }, val({g1 , g2 })). If the value
of the function is true, then we take the next object g3 of s∗ and evaluate
the function to be test({g1 , g2 , g3 }, val({g1 , g2 , g3 })). If the value of the function
to be test({g1 , g2 }, val({g1 , g2 })) is f alse, then the object g2 of s∗ is skipped
and the function to be test({g1 , g3 }, val({g1 , g3 }))) is evaluated. We continue this
process until we achieve the last object of s∗ . This procedure can be considered
as a realization of the diagnostic rule of the second type.
    To illustrate this procedure, we use the sets D+ and D− represented in Tab.2
and 3 (our illustrative example). In these tables M = {m1 , . . . , m26 }. The set
SPLUS0 for positive class of objects is in Tab.4. The initial content of STGOOD0
is {(2,10), (3, 10), (3, 8), (4, 12), (1, 4, 7), (1, 5,12), (2, 7, 8), (3, 7, 12), (1, 2,
12, 14), (2, 3, 4, 7), (4, 6, 8, 11)}.
    In what follows, we denote subsets of values {m8 , m9 }, {m14 , m15 } by ma and
mb , respectively. Applying operation generalization of(s) = s00 = obj(val(s)) to
∀s ∈ STGOOD, we obtain STGOOD1 = {(2,10), (3, 10), (3, 8), (4, 7, 12), (1, 4,
7), (1, 5,12), (2, 7, 8), (3, 7, 12), (1, 2, 12, 14), (2, 3, 4, 7), (4, 6, 8, 11)}.
    By Th.1, we can delete value m12 from consideration, see splus(m12 ) in Tab.4.
The initial content of STGOOD allows to decrease the number of using the
procedure to be test() and the number of putting extents of tests into STGOOD.
    The number of subtasks to be solved. This number is determined by the
number of essential values in the set M . The quasi-minimal subset of essential
values in M can be found by a procedure analogous to the procedure applicable
to search for the initial content of STGOOD.
    As a result of this procedure, we have quasi-maximal subset sbmax(M ) of
M , such that (obj(sbmax(M )), sbmax(M )) is not a test for positive examples.
Then subset LEV = M \sbmax(M ) is quasi-minimal subset of essential values in
M . We have the following LEV : {m16 , m18 , m19 , m20 , m21 , m22 , m23 , m24 , m26 }.
Note, that searching for the number of subtasks to be solved is also based on
using the diagnostic rule of the second type.
           Context-Dependent Reasoning Based on Good Diagnostic Tests         73



              Table 2. The set D+ of positive object descriptions


          G D+
          1 m1 m2 m5 m6 m21 m23 m24 m26
          2 m4 m7 m8 m9 m12 m14 m15 m22 m23 m24 m26
          3 m3 m4 m7 m12 m13 m14 m15 m18 m19 m24 m26
          4 m1 m4 m5 m6 m7 m12 m14 m15 m16 m20 m21 m24 m26
          5 m2 m6 m23 m24
          6 m7 m20 m21 m26
          7 m3 m4 m5 m6 m12 m14 m15 m20 m22 m24 m26
          8 m3 m6 m7 m8 m9 m13 m14 m15 m19 m20 m21 m22
          9 m16 m18 m19 m20 m21 m22 m26
          10 m2 m3 m4 m5 m6 m8 m9 m13 m18 m20 m21 m26
          11 m1 m2 m3 m7 m19 m20 m21 m22 m26
          12 m2 m3 m16 m20 m21 m23 m24 m26
          13 m1 m4 m18 m19 m23 m26
          14 m23 m24 m26




             Table 3. The set D− of negative object descriptions


G D−                           G D−
15 m3 m8 m16 m23 m24           32 m1 m2 m3 m7 m9 m13 m18
16 m7 m8 m9 m16 m18            33 m1 m5 m6 m8 m9 m19 m20 m22
17 m1 m21 m22 m24 m26          34 m2 m8 m9 m18 m20 m21 m22 m23 m26
18 m1 m7 m8 m9 m13 m16         35 m1 m2 m4 m5 m6 m7 m9 m13 m16
19 m2 m6 m7 m9 m21 m23         36 m1 m2 m6 m7 m8 m13 m16 m18
20 m19 m20 m21 m22 m24         37 m1 m2 m3 m4 m5 m6 m7 m12 m14 m15 m16
21 m1 m20 m21 m22 m23 m24      38 m1 m2 m3 m4 m5 m6 m9 m12 m13 m16
22 m1 m3 m6 m7 m9 m16          39 m1 m2 m3 m4 m5 m6 m14 m15 m19 m20 m23 m26
23 m2 m6 m8 m9 m14 m15 m16     40 m2 m3 m4 m5 m6 m7 m12 m13 m14 m15 m16
24 m1 m4 m5 m6 m7 m8 m16       41 m2 m3 m4 m5 m6 m7 m9 m12 m13 m14 m15 m19
25 m7 m13 m19 m20 m22 m26      42 m1 m2 m3 m4 m5 m6 m12 m16 m18 m19 m20 m21 m26
26 m1 m2 m3 m5 m6 m7 m16       43 m4 m5 m6 m7 m8 m9 m12 m13 m14 m15 m16
27 m1 m2 m3 m5 m6 m13 m18      44 m3 m4 m5 m6 m8 m9 m12 m13 m14 m15 m18 m19
28 m1 m3 m7 m13 m19 m21        45 m1 m2 m3 m4 m5 m6 m7 m8 m9 m12 m13 m14 m15
29 m1 m4 m5 m6 m7 m8 m13 m16 46 m1 m3 m4 m5 m6 m7 m12 m13 m14 m15 m16 m23 m24
30 m1 m2 m3 m6 m12 m14 m15 m16 47 m1 m2 m3 m4 m5 m6 m8 m9 m12 m14 m16 m18 m22
31 m1 m2 m5 m6 m14 m15 m16 m26 48 m2 m8 m9 m12 m14 m15 m16
74       Xenia Naidenova and Vladimir Parkhomenko

                                  Table 4. The set SPLUS0


      splus(m), m ∈ M                  splus(m), m ∈ M
      splus(ma ) → {2, 8, 10}         splus(m22 ) → {2, 7, 8, 9, 11}
      splus(m13 ) → {3, 8, 10}        splus(m23 ) → {1, 2, 5, 12, 13, 14}
      splus(m16 ) → {4, 9, 12}        splus(m3 ) → {3, 7, 8, 10, 11, 12}
      splus(m1 ) → {1, 4, 11, 13}     splus(m4 ) → {2, 3, 4, 7, 10, 13}
      splus(m5 ) → {1, 4, 7, 10}      splus(m6 ) → {1, 4, 5, 7, 8, 10}
      splus(m12 ) → {2, 3, 4, 7}      splus(m7 ) → {2, 3, 4, 6, 8, 11}
      splus(m18 ) → {3, 9, 10, 13}    splus(m24 ) → {1, 2, 3, 4, 5, 7, 12, 14}
      splus(m2 ) → {1, 5, 10, 11, 12} splus(m20 ) → {4, 6, 7, 8, 9, 10, 11, 12}
      splus(mb ) → {2, 3, 4, 7, 8}    splus(m21 ) → {1, 4, 6, 8, 9, 10, 11, 12}
      splus(m19 ) → {3, 8, 9, 11, 13} splus(m26 ) → {1, 2, 3, 4, 6, 7, 9, 10, 11, 12, 13, 14}




Proposition 1. Each essential value is included at least in one positive object
description.

Proof. Assume that for an object description tg , g ∈ G+ , we have tg ∩ LEV = ∅.
Then tg ⊆ M \ LEV. But M \ LEV is included at least in one of the negative
object descriptions and, consequently, tg also possesses this property. But it
contradicts to the fact that tg is the description of a positive object.      t
                                                                              u

Proposition 2. Assume that X ⊆ M . If X ∩ LEV = ∅, then to be test(obj(X),
X) = f alse.

Proposition 3. Let K± be a given classification context. For finding all GRMTs
containing in K± , it is sufficient to solve this problem only for subcontexts asso-
ciated with essential values in M (the set LEV).

    Proposition 2 is the consequence of Proposition 1. The poof of Proposition 3
is obvious.
    Note that the description of t14 = {m23 , m24 , m26 } is closed because of
obj{m23 , m24 , m26 } = {1, 2, 12, 14} and val{1, 2, 12, 14} = {m23 , m24 , m26 }. We
also know that s = {1, 2, 12, 14} is closed too (we obtained this result during
generalization of elements of STGOOD). So (obj({m23 , m24 , m26 }), {m23 , m24 ,
m26 }) is a maximally redundant test for positive objects and we can, conse-
quently, delete t14 from consideration. As a result of deleting m12 and t14 , we
have the modified set SPLUS (Tab.5).

5.2     Using the Rule of Alternative for Splitting Subcontexts
Essentially, searching for GMRTs turns out a process of generating deductive
reasoning rules of the first type, which are used immediately during this process.
Now we shall give an example of constructing and using the rules of alternative
(forbidden rules).
               Context-Dependent Reasoning Based on Good Diagnostic Tests                   75

                                Table 5. The set SPLUS1


      splus(m), m ∈ M                  splus(m), m ∈ M
      splus(ma ) → {2, 8, 10}         splus(m22 ) → {2, 7, 8, 9, 11}
      splus(m13 ) → {3, 8, 10}        splus(m23 ) → {1, 2, 5, 12, 13}
      splus(m16 ) → {4, 9, 12}        splus(m3 ) → {3, 7, 8, 10, 11, 12}
      splus(m1 ) → {1, 4, 11, 13}     splus(m4 ) → {2, 3, 4, 7, 10, 13}
      splus(m5 ) → {1, 4, 7, 10}      splus(m6 ) → {1, 4, 5, 7, 8, 10}
                                      splus(m7 ) → {2, 3, 4, 6, 8, 11}
      splus(m18 ) → {3, 9, 10, 13}    splus(m24 ) → {1, 2, 3, 4, 5, 7, 12}
      splus(m2 ) → {1, 5, 10, 11, 12} splus(m20 ) → {4, 6, 7, 8, 9, 10, 11, 12}
      splus(mb ) → {2, 3, 4, 7, 8}    splus(m21 ) → {1, 4, 6, 8, 9, 10, 11, 12}
      splus(m19 ) → {3, 8, 9, 11, 13} splus(m26 ) → {1, 2, 3, 4, 6, 7, 9, 10, 11, 12, 13}




     Form all the pairs of objects (i, j) ⊆ G+ ; form a set FR of forbidden pairs of
objects, FR = {(i, j) : to be test((i, j), val(i, j)) is f alse} (Tab. 6); then the set
of admissible pairs is ADM = {(i, j) : to be test((i, j), val(i, j)) is true} (Tab.
6).
     Each forbidden pair (i, j) generates rule of alternative: j OR i → true(always);
i, j → f alse. This rule says that i and j cannot simultaneously enter in any ex-
tent of GMRTs, either i or j but not both.
     These rules of alternatives allow splitting any splus(m), m ∈ M into subsets
not containing any forbidden pair of objects. An example is in Tab. 7.
     As a result of splitting, we shall obtain the splitting tree, leaves of which
are subsets of objects not containing any prohibited object pairs. Subcontexts
for GMRTs inferring are leaves of the tree with the number of objects more or
equal to 4. So, in Tab. 7, we have only one subcontext (4,7,8,11) for searching
for tests. As far as triplets of objects, if a triplet is not the extent of a test, then
we have all information about pairs of objects contained in it.
     An analysis of set ADM leads to using some inductive reasoning rules [19].
If element j enters in sq+1 , then it must enter in q proper subsets of sq+1 , where
sq+1 is a set containing (q + 1) elements. If we observe that j enters in only one
doublet (pair), then it cannot enter in any triplet. If we observe that j enters in
only one triplet, then it cannot enter in any quadruplet and so on. If an element
enters in two and only two doublets, it means that it will enter only in one
triplet. If an element enters in three and only three doublets, it can enter in only
one quadruplet.
     This inductive reasoning is applicable to constructing triplets from doublets,
quadruplets from triplets and so on. In our example: Object 9 enters in one and
only one doublet, hence (9,11) cannot be included in any triplet. We conclude
that (9,11) is the extent of a GMRT. Object 5 enters in two and only two pairs,
hence it is included in only one triplet (1,5,12) already contained in the initial
content of STGOOD. Object 5 cannot be included in any quadruplet. Object 6
76     Xenia Naidenova and Vladimir Parkhomenko




                           Table 6. Classification of object pairs


 g   Admissible pairs of objects                       Forbidden pairs of objects
                                                      (1,3) (1,6) (1,8) (1,9) (1,10)
 1   (1,2) (1,4) (1,5) (1,7) (1,12)
                                                      (1,11) (1,13)
 2 (2,1) (2,3) (2,4) (2,7) (2,8) (2,10) (2,12)        (2,5) (2,6) (2,9) (2,11) (2,13)
 3 (3,2) (3,4) (3,7) (3,8) (3,10) (3,11) (3,12)       (3,1) (3,5) (3,6) (3,9) (3,13)
 4 (4,1) (4,2) (4,3) (4, 6) (4,7) (4,8) (4,11) (4,12) (4,5) (4,9) (4,10) (4,13)
                                                      (5,2) (5,3) (5,4) (5,6) (5,7)
 5 (5,1) (5,12)
                                                      (5,8) (5,9) (5,10) (5,11) (5,13)
                                                      (6,1) (6,2) (6,3) (6,5) (6,7)
 6 (6,4) (6,8) (6,11)
                                                      (6,9) (6,10) (6,12) (6,13)
 7 (7,1) (7,2) (7,3) (7,4) (7,8) (7,11) (7,12)        (7,5) (7, 6) (7,9) (7,10) (7,13)
 8 (8,2)(8,3)(8,4) (8,6)(8,7) (8,10) (8,11)           (8,1) (8,5) (8,9) (8,12) (8,13)
 9 (9,11)                                             All the other pairs
                                                      (10,1) (10,4) (5,10) (10,6) (10,7)
 10 (10,2) (10,3) (10,8)
                                                      (10,9) (10,11) (10,12) (10,13)
                                                      (11,1) (11,2) (5,11) (11,9)
 11 (11,3) (11,4) (6,11) (7,11) (11,9)
                                                      (11, 10) (11, 12) (11,13)
                                                      (12,6) (12,8) (12,9) (12,10)
 12 (12,1) (12,2) (12,3) (12,4) (12,5) (12,7)
                                                      (12,11) (12,13)
 13 ∅                                                 All the pairs




                      Table 7. An example of splitting splus(m2 )


                       No splus(m2 )                Forbidden pair
                       1     (4,7,8,11,12)           (8,12)
                       2     (4,7,8,11), (4,7,11,12)
                       3     (4,7,11,12)             (11,12)
                       4     (4,7,11) (4,7,12)
               Context-Dependent Reasoning Based on Good Diagnostic Tests              77


enters in three and only three pairs, hence it is included in only one quadruplet
(4,6,8,11), already contained in STGOOD. Object 13 is not included in any
admissible pair of objects. We put {9, 11} and {13} into STGOOD and delete
objects 9, 5, 6, and 13 from consideration. This action implies deleting values
m16 , m18 , and m23 (see, please Tab. 8), because splus(mi ), where i take values
16, 18 or 23, are absorbed by some elements of STGOOD.


                                Table 8. The set SPLUS2


            splus(m), m ∈ M               splus(m), m ∈ M
            splus(ma ) → {2, 8, 10}      splus(m22 ) → {2, 7, 8, 11}
            splus(m13 ) → {3, 8, 10}     splus(m23 ) → {1, 2, 12}
            splus(m16 ) → {4, 12}        splus(m3 ) → {3, 7, 8, 10, 11, 12}
            splus(m1 ) → {1, 4, 11}      splus(m4 ) → {2, 3, 4, 7, 10, 13}
            splus(m5 ) → {1, 4, 7, 10}   splus(m6 ) → {1, 4, 7, 8, 10}
                                         splus(m7 ) → {2, 3, 4, 6, 8, 11}
            splus(m18 ) → {3, 10}        splus(m24 ) → {1, 2, 3, 4, 7, 12}
            splus(m2 ) → {1, 10, 11, 12} splus(m20 ) → {4, 7, 8, 10, 11, 12}
            splus(mb ) → {2, 3, 4, 7, 8} splus(m21 ) → {1, 4, 8, 9, 10, 11, 12}
            splus(m19 ) → {3, 8, 11}     splus(m26 ) → {1, 2, 3, 4, 7, 10, 11, 12}




    We analyze triplets in SPLUS: splus(m∗ ) = {2, 8, 10}, splus(m1 ) = {1, 4, 11},
splus(m19 ) = {3, 8, 11}, and splus(m13 ) = {3, 8, 10}. As a result, we obtain
the extents of two new GMRTs {3, 11} and {8, 10}. After that, we can delete
object 10. Then we can delete values m5 , m2 , and m4 after analyzing their
modified splus. Now we split splus(m22 ), splus(m20 ), splus(m21 ), splus(m24 ), and
splus(m26 ), because these values are the only remaining essential values in the
set LEV. Tab. 9 shows the number of subtasks in splitting trees for these essential
values and new obtained tests.


                    Table 9. Splitting splus(m), m is essential value


 splus(m)     The number of subtasks New tests Deleted elements
 splus(m22 )                              {7, 8, 11} m26
 splus(m20 ) 1                                       m20
 splus(m21 )                                         m21 , t8
                                                     m24 , t1 , t12 , t7 ; remaining set
 splus(m24 ) 2                                       of indices {2, 3, 4} is absorbed
                                                     by element {2, 3, 4, 7} in STGOOD
78     Xenia Naidenova and Vladimir Parkhomenko


   Some words about future investigations. The most important mathematical
tasks worth solving are the following:
1. For irredundant splus(m) to determine whether it contains extents of new
   GMRTs;
2. Optimization of selecting forbidden pairs of objects in order not to obtain
   repetition of subsets in splitting trees.


6    Conclusion
A sketch of classification reasoning is given in the paper. The key notions of
the reasoning are notions of classification and its good approximations based
on good diagnostic tests. The good diagnostic tests have the dual nature: they
generate functional and implicative dependencies and, in the same time, sets of
objects satisfying these dependencies. The dependencies are logical assertions on
the base of which deductive phase of classification reasoning is realized.
    In the paper, we deal only with GMRTs as value implications. An algorithm
for inferring GMRTs from a set of data is given. A decomposition of the main
task of inferring GMRTs into subtasks associated with the certain types of sub-
contexts is introduced.
    There are various ways of using the proposed decompositions and extracted
initial information about GMRTs, but we confine ourselves to the most impor-
tant ideas allowing to obtain not only tests (and corresponding knowledge) but
also, simultaneously, contexts inside of which these tests have been inferred.
Currently, there is considerable interest in contextual information search and
representation for using it to improve web search results or query expansion
[3,4,10]. Classification reasoning can serve as a basic tool for constructing and
using content-dependent knowledge in different problem domains and applica-
tions.


References
 1. Baixeries, J., Kaytoue, M., Napoli, A.: Computing functional dependencies with
    pattern structures. In: Szathmary, L., Priss, U. (eds.) Proceedings of The Ninth In-
    ternational Conference on Concept Lattices and Their Applications. CEUR Work-
    shop Proceedings, vol. 972, pp. 175–186. CEUR-WS.org (2012)
 2. Banerji, R.B.: Theory of problem solving: an approach to artificial intelligence,
    vol. 17. Elsevier Publishing Company (1969)
 3. Bankar, S., Nagpure, R.: Contextual information search based on domain using
    query expansion. In: International Journal of Emerging Trends and Technology in
    Computer Science. vol. 3, pp. 148 – 151 (2014)
 4. Bouramoul, A., Kholladi, M.K., Doan, B.L.: PRESY: A Context Based Query
    Reformulation Tool for Information Retrieval on the Web. Journal of Computer
    Science 6(4), 470–477 (2010)
 5. Ferré, S., Ridoux, O.: The use of associative concepts in the incremental building
    of a logical context. In: Priss, U., Corbett, D., Angelova, G. (eds.) ICCS. Lecture
    Notes in Computer Science, vol. 2393, pp. 299–313. Springer (2002)
              Context-Dependent Reasoning Based on Good Diagnostic Tests              79


 6. Finn, V.: The synthesis of cognitive procedures and the problem of induction. NTI
    ser.2, 1-2, 8–44 (1999), (in Russian)
 7. Ganter, B., Wille, R.: Formal concept analysis: mathematical foundations.
    Springer, Berlin (1999)
 8. Ganter, B., Kuznetsov, S.O.: Formalizing hypotheses with concepts. In: Proceed-
    ings of the Linguistic on Conceptual Structures: Logical Linguistic, and Computa-
    tional Issues. pp. 342–356. Springer-Verlag (2000)
 9. Ganter, B., Kuznetsov, S.: Pattern structures and their projections. In: Delugach,
    H., Stumme, G. (eds.) Conceptual Structures: Broadening the Base, Lecture Notes
    in Computer Science, vol. 2120, pp. 129–142. Springer Berlin Heidelberg (2001)
10. Klimešová, D., Ocelı́ková, E.: Study on context understanding, knowledge trans-
    formation and decision support systems. WSEAS Trans. Info. Sci. and App. 7(7),
    985–994 (Jul 2010)
11. Kuznetsov, S.: Machine learning on the basis of formal concept analysis. Automa-
    tion and Remote Control 62(10), 1543–1564 (2001)
12. Mill, J.S.: The System of Logic Ratiocinative and Inductive Being a Connected
    View of the Principles of Evidence, and the Methods of Scientific Investigation.
    Longmans (1872)
13. Naidenova, K.: Reducing one class of machine learning algorithms to logical oper-
    ations of plausible reasoning. Journal of Computer and Systems Sciences Interna-
    tional 48(3), 401–414 (2009)
14. Naidenova, X.A., Plaksin, M.V., Shagalov, V.L.: Inductive Inferring All Good Clas-
    sification Tests. In: Valkman, J. (ed.) ”Knowledge-Dialog-Solution”, Proceedings of
    International Conference. vol. 1, pp. 79 – 84. Kiev Institute of Applied Informatics,
    Kiev, Ukraine (1995)
15. Naidenova, X.: Machine Learning as a diagnostic task. In: Arefiev, I. (ed.)
    Knowledge-Dialog-Solution, Materials of the Short-Term Scientific Seminar, pp. 26
    – 36. State North-West Technical University Press, St Petersburg, Russia (1992)
16. Naidenova, X.: The Data-Knowlege Transformation. In: Solovyev, V. (ed.) Text
    Processing and Cognitive Technologies, vol. 3, pp. 130–151. Pushchino (1999)
17. Naidenova, X.: Good classification tests as formal concepts. In: Domenach, F.,
    Ignatov, D., Poelmans, J. (eds.) Formal Concept Analysis, Lecture Notes in Com-
    puter Science, vol. 7278, pp. 211–226. Springer Berlin Heidelberg, Leuven (2012)
18. Naidenova, X., Ermakov, A.: The decomposition of good diagnostic test inferring
    algorithms. In: Alty, J., Mikulich, L., Zakrevskij, A. (eds.) ”Computer-Aided De-
    sign of Discrete Devices” (CAD DD2001), Proceedings of the 4-th Inter. Conf.,
    vol. 3, pp. 61 – 68. Minsk (2001)
19. Naidenova, X.: An incremental learning algorithm for inferring logical rules from
    examples in the framework of the common reasoning process. In: Triantaphyllou,
    E., Felici, G. (eds.) Data Mining and Knowledge Discovery Approaches Based on
    Rule Induction Techniques, Massive Comp., vol. 6, pp. 89–147. Springer (2006)
20. Naidenova, X.A., Parkhomenko, V.: Attributive and object subcontexts in inferring
    good maximally redundant tests. In: Bertet, K., Rudolph, S. (eds.) Proceedings of
    the Eleventh International Conference on Concept Lattices and Their Applications,
    Košice, Slovakia, October 7-10, 2014. CEUR Workshop Proceedings, vol. 1252, pp.
    181–193. CEUR-WS.org (2014)
21. Ore, O.: Galois connections. Trans. Amer. Math. Soc 55, 494–513 (1944)
22. Rasiowa, H.: An Algebraic Approach to Non-classical Logics. Studies in logic and
    the foundations of mathematics, North-Holland Publishing Company (1974)
                Concept Lattices of RDF Graphs

                                       Jens Kötters


         Abstract. The concept lattice of an RDF graph is defined. The intents
         are described by graph patterns rather than sets of attributes, a view that
         is supported by the fact that RDF data is essentially a graph. A sim-
         ple formalization by triple graphs defines pattern closures as connected
         components of graph products. The patterns correspond to conjunctive
         queries, generalization of properties is supported. The relevance of the
         notion of connectedness is discussed, showing the impact of ontological
         considerations on the process of concept formation. Finally, definitions
         are given which allow the construction of pattern closures of concept
         graphs, which formalize Conceptual Graphs.

         Keywords: RDF Schema, Pattern Concepts, Conceptual Graphs, Nav-
         igation


  1    Introduction
  The Resource Description Framework (RDF) is an extensible standard for knowl-
  edge representation which relies on the use of simple sentences, each consisting
  of a subject, a predicate and an object. In RDF terminology, such a sentence
  is called a triple; a collection of triples is called an RDF graph, and the entities
  which occur as subjects, predicates or objects of triples (i.e.,the things being
  talked about) are called resources. Figure 1 shows an RDF graph in the text-
  based Turtle notation [5] (namespace declarations are omitted). Figure 2 shows
  the same RDF graph, using a standard graphical notation (see e.g. [9]). Each
  triple is represented by an arc. In the abundance of such data, query languages,
  most notably SPARQL [1], can be used to gain access to the information con-
  tained. Querying alone is however of limited use to anyone who needs information
  but does not know specifically what to look for, or how to ask for it, or maybe
  whether such information can be found at all. In such cases, concept lattices can
  in principle guide the exploration of the data source, as they support interac-
  tive and data-aware query modification. In connection with SPARQL, this has
  already been demonstrated in [7]. The current paper is also motivated by data
  exploration, but deliberately limits the query language to conjunctive queries. In
  this case, the entire search space is obtained as a concept lattice in which each
  intent is described by a single most specific query (MSQ).
      Conjunctive queries can be represented as graphs, and entailment is described
  by graph homomorphisms [6]. Figure 3 shows a graph representing a conjunc-
  tive query – ”Someone who spoke about a sailor (in the subject)” – and its
  SPARQL translation below, which can always be obtained in a straightforward
  way. Graph queries are modeled after RDF graphs (Fig.2), so that solutions are

c 2015 for the individual papers by the papers’ authors. Copying permitted for private
  and academic purposes. This volume is published and copyrighted by its editors.
  Local Proceedings also appeared in ISBN 978-84-606-7410-8, Universidad de Málaga
  (Dept. Matemática Aplicada), Spain.
82    Jens Kötters


ex:Frank    ex:loves             ex:Mary .
ex:Frank    rdf:type             ex:Sailor .
ex:Frank    rdf:type             ex:Englishman .
ex:Tom      ex:loves             ex:Mary .
ex:Tom      rdf:type             ex:Gentleman .
ex:Tom      ex:dislikes          ex:Frank .
ex:Frank    ex:dislikes          ex:Tom .
ex:Tom      ex:name              "Tom" .
ex:Tom      ex:said              ex:S1 .
ex:S1       rdf:subj             ex:Tom .
ex:S1       rdf:pred             ex:name .
ex:S1       rdf:obj              "Tom" .
ex:Frank    ex:jested            ex:S2.
ex:S2       rdf:subj             ex:Frank .
ex:S2       rdf:pred             ex:likes .
ex:S2       rdf:obj              ex:Lobscouse .

ex:Liam     ex:loves             ex:Aideen .
ex:Aideen   ex:loves             ex:Liam .
ex:Rowan    ex:loves             ex:Aideen .
ex:Rowan    ex:hates             ex:Liam .
ex:Aideen   ex:said              ex:S3 .
ex:S3       rdf:subj             ex:Rowan .
ex:S3       rdf:pred             ex:likes .
ex:S3       rdf:obj              ex:Aideen .


                      Fig. 1. Sample RDF graph in Turtle notation



                    rdf:pred ex:S2                                 ex:S1   rdf:pred
       ex:likes                                ex:Mary
                                                           rdf:subj                   ex:name
                       rdf:obj      rdf:subj
                        ex:jested   ex:loves ex:loves      ex:said rdf:obj
     ex:lobscouse
                                        ex:dislikes
                     rdf:type ex:Frank               ex:Tom
                                        ex:dislikes          ex:name
                  ex:Sailor rdf:type                rdf:type
                                                                     "Tom"
                             ex:Englishman          ex:Gentleman


                             ex:Rowan           rdf:subj
                  ex:hates                            ex:S3
                                    ex:loves ex:said                   rdf:pred
                             ex:loves
             ex:Liam                    ex:Aideen rdf:obj                  ex:likes
                             ex:loves


                       Fig. 2. Drawing of the RDF graph of Fig.1
                                               Concept Lattices of RDF Graphs    83


described by homomorphisms, as well. But there are some differences to RDF
graphs. Firstly, all nodes (and all arcs) represent variables. The subject(s) of the
query are distinguished in some way (in Fig. 3, a black node represents the single
subject), whereas all other variables are implicitly understood as being existen-
tially quantified. Finally, the labels are attributes of a formal context K (for the
example, see Fig. 4), and thus represent formal concepts of its concept lattice
B(K). In practice, the formal context may encode background knowledge from
an RDF Schema file like the one in Fig. 5. The query vocabulary is restricted
to the attributes of K, the name Frank e.g. may not be used in a query for the
given example.


                    ?x    ex:said                rdf:type


                                    rdf:subj                ex:Sailor

  SELECT ?x WHERE { ?x ex:said ?y. ?y rdf:subj ?z. ?z rdf:type ex:Sailor. }



       Fig. 3. Conjunctive query as a graph (above) and in SPARQL (below)



    For the rest of the paper, we shall refer to queries as patterns. Sect. 2 intro-
duces triple graphs as formalizations of queries without distinguished variables
(they correspond to SPARQL WHERE-clauses), and a query in one distinguished
variable is then represented by the pair (x, G) consisting of a triple graph G
and a vertex x of G. The case of queries in several variables is not treated in
this paper, but has been treated in [11] in connection with relational structures.
Section 3 shows how the search space for conjunctive queries over an RDF graph
is obtained as a concept lattice, where concept intents are patterns rather than
sets of attributes.
    Lattices of pattern concepts were introduced in [8], and although the theory
presented there does not make specific assumptions of what a pattern is, it seems
clear that graphs were meant to play a prominent role. In fact, the paper already
presents an example application where patterns are chemical compounds and
compares them by labeled graph homomorphisms. While chemical compounds
are described by connected graphs, their supremum in terms of these homo-
morphisms is not necessarily connected (ibid., pp. 139-140). Another suggestion
made in [8] is to use Conceptual Graphs (CGs) as patterns. If graph patterns
are used to describe situations (where several objects combine to a whole by
means of the way they are related to each other), we would probably expect
them to be connected. But if we formalize CGs using the same kind of labeled
graphs that was used for chemical compounds, the set of patterns is generally
not closed under suprema because, as we have seen, the supremum operation
does not preserve connectedness.
84    Jens Kötters




                         Englishman
                         Gentleman




                         lobscouse
                         Resource
                         dislikes




                         ”Tom”
Properties and Classes




                         Person
                         jested




                         Sailor
                         name



                         hates
                         loves




                         Man
                         pred
                         type

                         likes




                         subj
                         said


                         obj
name                     ×                              ×
type                         ×                          ×
loves                            ××                     ×
likes                             ×                     ×
hates                               ××                  ×
dislikes                             ×                  ×
jested                                 ××               ×
said                                    ×               ×
subj                                      ×             ×
pred                                        ×           ×
obj                                           ×         ×
Englishman                                      × × ×   ×
Gentleman                                        ×× ×   ×
Man                                               × ×   ×
Sailor                                             ××   ×
Person                                              ×   ×
”Tom”                                                 × ×
lobscouse                                              ××
Resource                                                ×

                 Fig. 4. Formal context K of properties and classes




 ex:Englishman   rdfs:subClassOf            ex:Man
 ex:Gentleman    rdfs:subClassOf            ex:Man
 ex:Sailor       rdfs:subClassOf            ex:Person
 ex:Man          rdfs:subClassOf            ex:Person
 ex:loves        rdfs:subPropertyOf         ex:likes
 ex:hates        rdfs:subPropertyOf         ex:dislikes
 ex:jested       rdfs:subPropertyOf         ex:said
 ex:loves        rdfs:domain                ex:Person
 ex:loves        rdfs:range                 ex:Person



                                 Fig. 5. An RDFS ontology
                                             Concept Lattices of RDF Graphs         85


    The situation changes if graphs have a distinguished element. The supremum
is a graph product, again with a distinguished element, and although the product
of connected graphs is generally not connected, it makes sense to identify the
supremum with the connected component which holds the distinguished variable.
The relevance of connectedness is discussed in Sect. 4. In Sect. 5, the approach
is applied to concept graphs.

2    Triple Graphs
A triple graph over K is a triple (V, T, κ), where V is a set of vertices, T ⊆
V × V × V and κ : V → B(K).
   A morphism ϕ : (V1 , T1 , κ1 ) → (V2 , T2 , κ2 ) of triple graphs is a map ϕ : V1 →
V2 with
                         (x, y, z) ∈ T1 ⇒ (ϕx, ϕy, ϕz) ∈ T2 ,                      (1)
                                             κ1 (x) ≥ κ2 (ϕx)                      (2)
for all x, y, z ∈ V1 .
    The product of a family of triple graphs Gi = (Vi , Ti , κi ), i ∈ I, is the triple
graph                 Y                                 _

                   i∈I
                        Gi := (  ×
                                 Vi , { tT | t ∈
                                 i∈I
                                                  ×
                                                  Ti },
                                                  i∈I
                                                             κi ),
                                                           i∈I
                                                                                    (3)
                                                           W           W
where tT := ((ti0 )i∈I , (ti1 )i∈I , (ti2 )i∈I ) and ( i∈I κi )(v) := i∈I κi (vi ).
   Finally, an RDF graph with the set T of triples is represented by the triple
graph (R, T, γ ∗ ), where R is the set of resources (properties and classes are
duplicated so that they appear in only one triple) and
                                     (
                                       (g 00 , g 0 ) if g ∈ G,
                       γ (g) :=
                         ∗
                                                                     .              (4)
                                       > := (G, G0 ) otherwise
   A pattern is a pair (v, H), where H =: (V, T, κ) is a triple graph and v ∈ V .
A pattern morphism ϕ : (v1 , H1 ) → (v2 , H2 ) is a morphism ϕ : H1 → H2 with
ϕ(v1 ) = v2 . A solution is a pattern in the data graph. Pattern morphisms can
thus describe solutions as well as entailment.
   The product of a family of patterns is given by
                          Y                            Y
                             (vi , Hi ) := ((vi )i∈I ,   Hi ).               (5)
                           i∈I                     i∈I

    The definitions above follow a category theoretical approach. Queries (minus
distinguished variable(s)) and the data source have representations in the same
category (using ∆ for the data source), and morphisms λ : G → ∆ can be under-
stood as solutions of the respective query. Query entailment is then described
by morphisms, and if the category has products, then so has the category of
structured arrows (these are the pairs (ν, G), cf. [2]), which model queries with
distinguished variables. The pattern closures (MSQs) arise as products from the
set of possible solutions (λ, ∆). For convenience, we write (x, G) instead of (ν, G)
if the domain of ν is a one-element set (x being the image of that element).
86     Jens Kötters


3    Example RDF Graph and Concept Lattice
In this section, we will walk through the process of generating a concept lattice
for the RDF graph in Fig. 1. To keep the example small (and perhaps meaning-
ful), only person concepts will be created, i.e. concepts with extents in the set
{Tom, Frank, Mary, Rowan, Liam, Aideen}.
    As a first step, we determine the object intent for each person, i.e. its most
complete description. For each person, its connected component in Fig. 2 dou-
bles as a most complete description if we reinterpret the nodes and arcs as
(distinct) variables, reinterpret the labels beside them as concept names (rather
than resource names), and mark the variable corresponding to that person as
distinguished. Let G1 and G2 denote, top to bottom, the components in Fig. 2
after reinterpretation, and let T , F , M , R, L and A denote the variables which re-
place Tom, Frank, Mary, Rowan, Liam and Aideen. Then the pairs (T, G1 ), (F, G1 ),
(M, G1 ), (R, G2 ), (L, G2 ) and (A, G2 ) formalize the object intents. These pairs
are considered patterns; each pattern consists of a distinguished variable and a
triple graph (formalized in Sect. 2).
    The person concepts are obtained from graph products of the six base pat-
terns. Many concept intents have the same underlying triple graph (consider e.g.
the base patterns), so there would be a lot of redundancy in the concept lattice
if we were to draw intents next to each concept. A different kind of diagram
avoids this redundancy while still showing all concepts and their intents : Fig. 6
shows all triple graphs which occur in the intents of person concepts, ordered by
homomorphism. Choosing any of the nodes as a distinguished variable gives rise
to a pattern intent, and this way concepts can be identified with triple graph
nodes. The gray nodes in Fig. 6 are the person concepts (the white nodes are
non-person concepts which occur in the descriptions of person concepts). The
concept extents for the person concepts are drawn next to the gray nodes. The
person concept lattice itself can be easily obtained from the diagram in Fig. 6
by connecting the gray nodes according to their extents.
    Note however that some concepts have multiple representations in Fig. 6.
These may occur in the same triple graph, which indicates symmetry (i.e., a
triple graph automorphism). An example is the pair of lovers on the right side
of Fig. 6. The other case is where the same concept occurs in different triple
graphs. An example would be the Mary concept, which occurs in the lower left
triple graph as well as in the triple graph above it on the left side of Fig. 6. In
such cases, there is always a most specific triple graph containing that concept; it
describes that concept’s pattern intent. Other triple graphs containing the same
concept can be folded onto the most specific triple graph (which is contained as
a subgraph). But the fact that triple graphs may reappear as subgraphs of other
triple graphs translates into another redundancy problem when drawing the
hierarchy of triple graph product components. In Fig. 6, this kind of redundancy
has been avoided by cutting those subgraphs out and indicate the glue nodes
(for the gluing operation which inverts the cutting operation) with an asterisk.
The glue nodes are only marked in the upper triple graph, not in the lower
triple graph, because the lower triple graph can be considered self-contained (a
                                                                                    Concept Lattices of RDF Graphs                                                     87




                                                                                               L F
                                                                                               RT
                                                                                               AM



                                                                                     LR                       A
                                                                                     A                        L
                                                                                     FT        loves          M



                                                                                    LR
                                                                                    FT         loves      * MA                                        loves
                                                                                                                                              A               loves
                                                                                                                                              L
                                                                                                                                              R
                                                                                                                                                      L
                                                                                                                                                        * loves * A
                                                      R                                                                                               A                   L
                                                      F
                                        pred          T
               loves       said                           loves
         *L       A                        subj
                                                   dislikes
                                                                  * MA
                  F                 obj
         M                                           L           loves
                  T
                                                     F
                                                     T




pred                                                      pred                                                                                                                 loves
        subj                   M            subj                                                          likes                                   R
                                                                                                                                                                   *
                                                                                                                                                             loves A
obj
               loves      *
                       dislikes
                                   loves
                                                           obj                   loves         said
                                                                                                          pred
                                                                                                                       R
                                                                                                                       F
                                                                                                                              loves       A
                                                                                                                                                  L
                                                                                                                                                                          L
                                                                                                                                                                               loves
                                                                                                                                                                                       A
       said                                    said                                                                                       M                               A            L
                       dislikes                                                                              subj
                 F                  T                                        L             A                          dislikes
                                    F                                        M                         obj
                 T                                                                         F                                      loves
        Man                             Man                                                                               L
                                                                                                                          T




                likes          pred                                                 pred
                                                                  M
                                            subj                      subj                                                       R
                                                   loves                                   name                                                   subj
                          obj                                 loves                                                            hates loves
                                   jested                                    said                                                                                  pred
                                                                                                                                   loves              said
               lobscouse                              dislikes                                                        L
                                  type                                                   obj                                          loves                  obj       likes
                                                      dislikes T             name                                                                 A
                                                  F                      type
                                   type
                      Sailor                         Gentleman
                                               Englishman                                  "Tom"



                                                                                                      T
                                                                                                                  T




                                          Fig. 6. Concepts of the RDF graph
88     Jens Kötters


glue node’s counterpart in the lower graph can be identified by its extent). To
put this differently: in every triple graph, the asterisk concepts are part of the
description of the non-asterisk concepts, but not vice versa.
    The extent of a concept reveals how its pattern intent can be obtained.
Consider for example the concept with extension {R, F, T }, which is part of the
top left triple graph. The extension tells us that the pattern intent is the (core
graph of the) component of the pattern product (R, G2 )×(F, G1 )×(T, G1 ), which
contains the vertex (R, F, T ), and (R, F, T ) becomes the designated variable. The
diagram also shows that {R, T } is a generating set for this extent, because it
is not contained in a concept extent of the triple graph’s lower neighbors, and
so (R, G2 ) × (T, G1 ) already produces the same pattern. As was described in
[8], Ganter’s NextConcept algorithm can be used to systematically generate all
extents (and thus all concepts), computing pattern products in the process. It is
advisable to use graph folding to obtain minimal equivalent patterns, for output
as well as inbetween computational steps.
    Product patterns could also be computed directly from the triples in Turtle
notation (Fig. 1), by combining triples with each other. Informally, we could
write triple graphs as sets of triples (x : κ(x), p : κ(p), y : κ(y)) (borrowing from
RDF and CG notation), and compute the product triples by taking suprema
componentwise, like so:

                           (F : >, p : type, y1 : Englishman)
                       ∨ (T : >, p : type, y2 : Gentleman)
                       =   ([ FT ] : >, [ pp ] : type, [ yy12 ] : Man).

However, one would still have to identify and minimize connected components.
    Note that the property and class hierarchies in our example are trees. So the
attribute concepts of K are closed under suprema, and each concept label can
be expressed by a single attribute. The suprema of properties may be of type
”Resource”. Such arcs are removed from the patterns; it can be shown (using
the product property) that the resulting patterns can still be considered MSQs.


4    Connectedness

We have identified two components of the RDF graph in Fig. 1, and this cor-
responds to our understanding that objects are ”connected” if they contribute
to the same instance of a situation. The notion of connectedness deserves closer
examination, however. Let us first observe that the notion of strong connectivity
is not the right one, because the second to top concepts in Fig. 6 (we might
name them ”Lover” and ”Beloved”) would not have been created under this
notion. But also, we can in general not rely on weak connectivity alone. If it had
been stated explicitly that all persons are of type ”Person”, all persons would
be connected through that class node.
    Clearly, objects do not contribute to the same situation just on the basis of
belonging to the same class, or having equal property values, like being of the
                                                 Concept Lattices of RDF Graphs            89


same age. So connectedness of patterns should not rely on properties, classes
or values. In this paper, no further requirements have been made, but if Liam
liked lobscouse (a traditional sailors’ dish), this would have not been sufficient
because all objects would be connected through the lobscouse node. From a
machine perspective, there is no indication in the RDF data (or the schema)
that ”ex:lobscouse” is of a different quality than ”ex:Tom” or ”ex:Mary”. A sys-
tem that generates patterns from automatically acquired data without proper
preprocessing would possibly generate a large number of meaningless (and un-
necessarily large) concepts because of such insubstantial connections, unless it
can be guaranteed that some standard for making such distinctions is applied
on the level of the ontology language.
    Further questions may include if and how real-world objects should be con-
nected through objects of spoken language; on what basis we could allow a
pattern like the one describing the set of all wifes older than their husband,
and at the same time keep meaningless comparison of values from having an
impact on concept formation; or if pattern connection should be described in
terms of edges rather than nodes, and if we can classify or characterize the kind
of relations which contribute to pattern formation. It seems that, when dealing
with pattern concepts, questions of philosophical ontology may have (at least
subjectively) measurable impact through the notion of connectedness.


5    Related Work
In Sect. 2, the category theoretical framework for lattices of pattern concepts has
been applied to triple graphs, and we will now show how it is applied to simple
concept graphs. In [15], a simple concept graph over a power context family
~ = (K0 , K1 , K2 , . . . ) is defined as a 5-tuple (V, E, ν, κ, ρ). We may interpret
K
the 4-tuple (V, E, ν, κ) as a query, K      ~ as a data source and the map ρ as a set
of solutions. The map κ assigns concepts of the contexts Ki to the vertices and
edges in V and E, respectively. The definition of queries should be independent
from any particular data source, so it makes sense to interpret K         ~ as a power
context family describing schema information (background knowledge) instead,
the contexts could describe attribute implications, like the one in Fig. 4. The
map ρ will however not be part of the query. For the rest of the exposition, we
shall call the 4-tuple (V, E, ν, κ) an abstract concept graph over K,    ~ after a term
used in the first paper on concept graphs [14, p.300].
    In [15], the triple (V, E, ν) is called a relational graph. The morphisms in
the category of abstract concept graphs over K        ~ are relational graph morphisms
(which replaces condition (1) for triple graphs) satisfying (2). The product is de-
fined in analogy to (3) (cartesian products of edges have to be taken individually
for each arity k).
    A datasource for the schema K    ~ is a power context family D ~ = (D0 , D1 , D2 , . . . )
for which the attribute implications described by Ki =: (Hi , Ni , Ji ) are satisfied
in Di =: (Gi , Mi , Ii ), i = 0, 1, 2, . . . , andSNi ⊆ Mi holds. We can represent D        ~
by an abstract concept graph D = (G0 , i≥1 Gi , id, γD ) with γD (u) := ((uIk ∩
90     Jens Kötters


Nk )Jk , uIk ∩ Nk ) ∈ B(Kk ) for u ∈ Gk . Let us furthermore define ExtD (κ(u)) :=
Ext(κ(u))Jk Ik for u ∈ V ∪ E. If we define a realization of G =: (V, E, κ, ν) over D
as a map with ρ(u) ∈ Ext∆ (κ(u)) for all u ∈ V ∪E and ρ(e) = (ρ(v1 ), . . . , ρ(vk )),
we obtain that the realizations ρ of G over D are precisely the morphisms ρ :
G → D. This means that product patterns for abstract concept graphs can be
interpreted as MSQs, as was mentioned at the end of Sect. 1.
    Triple graphs are now obtained as a special case of concept graphs: We define
a power context family D  ~ where D0 is a supercontext of K0 which in addition
contains all objects (having only the ”Resource” attribute), and where D3 =
(T, ∅, ∅) represents the set of triples.
    In Relational Concept Analysis, concept lattices for different kinds of inter-
related objects are generated by an iterative algorithm and, as in Fig. 6, illus-
trations displaying graphs of concepts have been given [4, 10]. Computational
aspects of generating pattern structures are covered e.g. in [13, 12]. In [3], con-
cept lattices are generated from Conceptual Graphs, but the approach seems to
be tailored towards a more specific case where edges (and thus paths) express
dependencies. A comparison with Relational Semantic Systems [16] still has to
be made.


6    Conclusion

In [11], a lattice of closures of database queries, which links products of query
patterns to the join operation on result tables, has been introduced. The queries
and database were formalized by relational structures. The fact that the method
could be applied again, first to RDF graphs and then to abstract concept graphs,
suggests that the underlying category theoretical notions provide a recipe that
could be applied to still different formalizations of queries, allowing in every case
the mathematical description of lattices of most specific queries. Combinatorial
explosion is a computational problem that quickly turns up when computing the
concept lattices. From a practical perspective, the lattices are useless if complex-
ity problems can not be solved. However, in order to support data exploration,
patterns must be understandable, thus also limited in size, and a solution to this
problem may entail a solution to the problem of computational complexity.
    In the section on connectedness, we have seen that ontological considerations
stand in a direct relation to the quality of computed concepts. As a first con-
sequence, although the idea of treating all kinds of resources in a homogeneous
way seemed appealing, triple graphs must be replaced by some other formal-
ization which reflects the importance of the distinction between instances and
classes/properties. While such questions naturally arise in the setting of pattern
concepts, they seem to have no obvious analogy for standard FCA, where intents
are sets of attributes. Pattern concepts have thus the potential to further and
substantiate a perspective on FCA as a branch of mathematical modeling, where
the entities to be modeled are not ”real world” systems but rather systems of
concepts. Future work may be concerned with extensions of RDF/RDFS which
support this perspective.
                                               Concept Lattices of RDF Graphs          91


References
 1. SPARQL 1.1 Query Language. Tech. rep., W3C (2013), http://www.w3.org/TR/
    sparql11-query
 2. Adámek, J., Herrlich, H., Strecker, G.E.: Abstract and concrete categories : the
    joy of cats. Pure and applied mathematics, Wiley, New York (1990)
 3. Andrews, S., Polovina, S.: A mapping from conceptual graphs to formal concept
    analysis. In: Andrews, S., Polovina, S., Hill, R., Akhgar, B. (eds.) Proceedings of
    ICCS 2011. LNCS, vol. 6828, pp. 63 – 76. Springer (2011)
 4. Azmeh, Z., Huchard, M., Napoli, A., Hacene, M.R., Valtchev, P.: Querying rela-
    tional concept lattices. In: Proc. of the 8th Intl. Conf. on Concept Lattices and
    their Applications (CLA’11). pp. 377–392 (2011)
 5. Beckett, D., Berners-Lee, T., Prud’hommeaux, E., Carothers, G.: RDF 1.1 Turtle.
    Tech. rep., W3C (2014), http://www.w3.org/TR/turtle
 6. Chandra, A.K., Merlin, P.M.: Optimal implementation of conjunctive queries in
    relational databases. In: Proceedings of the ninth annual ACM symposium on
    Theory of computing. pp. 77–90. STOC ’77, ACM, New York, NY, USA (1977)
 7. Ferré, S.: Conceptual navigation in RDF graphs with SPARQL-like queries. In:
    Kwuida, L., Sertkaya, B. (eds.) Proceedings of ICFCA 2010. LNCS, vol. 5986, pp.
    193–208. Springer, Heidelberg (2010)
 8. Ganter, B., Kuznetsov, S.O.: Pattern structures and their projections. In: Delu-
    gach, H.S., Stumme, G. (eds.) Proceedings of ICCS 2001. LNCS, vol. 2120, pp.
    129–142. Springer (2001)
 9. Hitzler, P., Krötzsch, M., Rudolph, S.: Foundations of Semantic Web Technologies.
    Chapman & Hall/CRC (2009)
10. Huchard, M., Hacene, M.R., Roume, C., Valtchev, P.: Relational concept discovery
    in structured datasets. Annals of Mathematics and Artificial Intelligence 49(1-4),
    39–76 (2007)
11. Kötters, J.: Concept lattices of a relational structure. In: Pfeiffer, H.D., Ignatov,
    D.I., Poelmans, J., Gadiraju, N. (eds.) Proceedings of ICCS 2013. LNCS, vol. 7735,
    pp. 301–310. Springer (2013)
12. Kuznetsov, S.O.: Computing graph-based lattices from smallest projections. In:
    Wolff, K.E., Palchunov, D.E., Zagoruiko, N.G., Andelfinger, U. (eds.) KONT/KPP.
    Lecture Notes in Computer Science, vol. 6581, pp. 35–47. Springer (2007)
13. Kuznetsov, S.O.: Fitting pattern structures to knowledge discovery in big data. In:
    Formal Concept Analysis, 11th International Conference, ICFCA 2013, Dresden,
    Germany, May 21-24, 2013. Proceedings. pp. 254–266 (2013), http://dx.doi.org/
    10.1007/978-3-642-38317-5_17
14. Wille, R.: Conceptual graphs and formal concept analysis. In: Lukose, D., Delu-
    gach, H.S., Keeler, M., Searle, L., Sowa, J.F. (eds.) Proceedings of ICCS 1997, 5th
    International Conference on Conceptual Structures. LNCS, vol. 1257, pp. 290–303.
    Springer, Heidelberg (1997)
15. Wille, R.: Formal concept analysis and contextual logic. In: Hitzler, P., Schärfe,
    H. (eds.) Conceptual Structures in Practice, pp. 137–173. Chapman & Hall/CRC
    (2009)
16. Wolff, K.E.: Relational scaling in relational semantic systems. In: Rudolph, S.,
    Dau, F., Kuznetsov, S.O. (eds.) Proceedings of ICCS 2009. LNCS, vol. 5662, pp.
    307–320. Springer (2009)
       Variability representation in product lines using
            concept lattices: feasibility study with
            descriptions from Wikipedia’s product
                     comparison matrices

                 Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez

                      LIRMM, CNRS & Université de Montpellier, France
                              jessiecarbonnel@gmail.com
                              surname.lastname@lirmm.fr



             Abstract. Several formalisms can be used to express variability in a
             product line. Product comparison matrix is a common and simple way
             to display variability of existing products from a same family, but they
             lack of formalisation. In this paper, we focus on concept lattices, another
             alternative already explored in several works to express variability. We
             first propose a method to translate a description from existing product
             comparison matrices into a concept lattice using Formal Concept Analy-
             sis. Then, we propose an approach to represent the case where a product
             family is described by other product families with interconnected lat-
             tices using Relational Concept Analysis. Because of the combinatorial
             aspect of these approaches, we evaluate the scalability of the produced
             structures. We show that a particular structure (AOC-poset) possesses
             interesting properties for the studies that we envision.

             Keywords: Product lines, Formal Concept Analysis, Variability Rep-
             resentation.


      1    Introduction
      In product line engineering [5], several formalisms can be used to depict vari-
      ability. Variability representation usually requires to take into account a large
      amount of data, and it is important to provide tools to help designers to represent
      and exploit them.
          Among existing formalisms, the most common is Product Comparison Matri-
      ces (PCMs). PCMs describe product properties in a tabular form. It is a simple
      way to display features of products from a same family and to compare them.
      However, there is no existing format or good practices to design these PCMs.
      Therefore, cells in PCMs lack of formalisation and it is difficult to perform auto-
      matic and efficient processing or analysis on them [4, 17]. Feature Models (FMs)
      constitute an alternative to PCMs. FMs describe a set of existing features and
      constraints between them, and thus represent all possible configurations of prod-
      ucts from a same family [6, 9, 11, 12]. They depict variability in a more formal




c 2015 for the individual papers by the papers’ authors. Copying permitted for private
  and academic purposes. This volume is published and copyrighted by its editors.
  Local Proceedings also appeared in ISBN 978-84-606-7410-8, Universidad de Málaga
  (Dept. Matemática Aplicada), Spain.
94     Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez

     way than PCMs, but, in their stardard form, FMs focus exclusively on features
     and do not specify if an existing product is associated with a configuration.

         Formal Concept Analysis (FCA) and concept lattices have already been stud-
     ied to express variability [13]. From a set of objects described by attributes, FCA
     computes subsets of objects that have common attributes and structures these
     subsets in a hierarchical way in concept lattices. Concept lattices can represent
     in a more formal way than PCMs informations related to variability. Through
     their structure, concept lattices highlight constraints between attributes like
     FMs, while keeping the relation between existing products of the family and
     the possible configurations. Besides, contrarily to FMs, which can have many
     various forms depending design choices, concept lattices represent variability in
     a canonical form. Moreover, FCA can be extended by Relational Concept Anal-
     ysis (RCA) which permits to take into account relationships between objects of
     separate lattices and provide a set of interconnected lattices. This is useful to
     classify sets of products from different categories.

         Concept lattices formal and structural aspects make them good candidates
     to apply automatic or manual processing including product comparison, research
     by attributes, partial visualisation around points of interest, or decision support.
     But the exponential growth of the size of concept lattices can make them difficult
     to exploit. In this paper, we will study the dimensions of these structures and
     try to find out if depicting product line variability with concept lattices provides
     structures that can be exploited from a perspective of size. For this, we will build
     in a first phase concept lattices from existing descriptions. Because there are
     abundant and focus on both products and features, we will extract descriptions
     from PCMs. Besides, we can find PCMs that possess a feature whose value
     domain corresponds to a set of products described by another PCM. It is a
     special case where a product family is described by another product family.
     In a second phase, we will model this case by interconnecting concept lattices
     obtained from the descriptions of these two PCMs using Relational Concept
     Analysis.

        In this paper, we want to answer these questions: How can we represent the
     variability expressed by a PCM with a concept lattice? To what extent can we
     model the case where a product family is described by another product family
     with RCA? Can we efficiently exploit structures obtained with FCA and RCA
     with regard to their size?

         The remainder of this paper is organised as follows. In Section 2, we will re-
     view Formal Concept Analysis and propose a way to build a concept lattice from
     a PCM’s description. In Section 3, we will study how to represent the case where
     a product family is described by another product family with Relational Concept
     Analysis. Then, in Section 4, we will apply these two methods on Wikipedia’s
     PCMs to get an order of magnitude of the obtained structure size. Section 5
     discusses related work. Section 6 presents conclusion and future work.
         Variability representation in product lines using concept lattices    95

2     Formal Concept Analysis and product comparison
      matrices

This section proposes an approach to represent with a concept lattice the vari-
ability originally expressed in a PCM.


2.1   Concept lattices and AOC-posets

Formal Concept Analysis [8] is a mathematical framework which structures a set
of objects described by attributes and highlights the differences and similarities
between these objects. FCA extracts from a formal context a set of concepts
that forms a partial order provided with a lattice structure: the concept lattice.
    A formal context is a 3-tuple (O, A, R) where O and A are two sets and
R ⊆ O × A a binary relation. Elements from O are called objects and elements
from A are called attributes. A pair from R states that the object o possesses
the attribute a. Given a formal context K = (O, A, R), a concept is a tuple
(E, I) such that E ⊆ O and I ⊆ A. It depicts a maximal set of objects that
share a maximal set of common attributes. E = {o ∈ O|∀a ∈ I, (o, a) ∈ R}
is the concept’s extent and I = {a ∈ A|∀o ∈ E, (o, a) ∈ R} is the concept’s
intent. Given a formal context K = (O, A, R) and two concepts C1 = (E1 , I1 )
and C2 = (E2 , I2 ) from K, C1 ≤ C2 if and only if E1 ⊆ E2 and I2 ⊆ I1 . C1 is
a subconcept of C2 and C2 is a superconcept of C1 . When we provide all the
concepts from K with the specialisation order ≤, we obtain a lattice structure
called a concept lattice.
    We represent intent and extent of a concept in an optimised way by making el-
ements appear only in the concept where they are introduced. Figure 7 represents
a concept lattice having simplified intent and extent. We call object-concept and
attribute-concept the concepts which introduce respectively at least an object or
an attribute. In Figure 7, Concept 7 and Concept 2 introduce neither attributes
nor objects: their simplified intent and extent are empty. If they are not nec-
essary, we can choose to ignore these concepts. Attribute-Object-Concept poset
(AOC-poset) from a formal context K is the sub-order of (CK ,≤) restricted to
object-concepts and attribute-concepts. Figure 4 presents the AOC-poset match-
ing the concept lattice from Figure 7. In our case, interesting properties of con-
cept lattices with regard to variability are preserved in AOC-posets: this smaller
structure can be used as an alternative to concept lattices.


2.2   Product Comparison Matrix

A PCM describes a set of products from a same family with variable features in
a tabular way. Figure 1 presents a PCM which describes four products against
two features, taken from Wikipedia.
    We notice that the cells of this PCM lack of formalisation: in this, different
values have the same meaning (Object-Oriented and OO) and it seems that there
are no rules on the use of value separator (’,’ or ’&’ or ’and’).
96     Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez

      Language   Standardized                        Paradigm
      Java           Yes                    Functional, Imperative, OO
      Perl           No                Functional & Procedural & Imperative
      Php            No       Functional, Procedural, Imperative and Object-Oriented
      C#             Yes        functional, procedural, imperative, Object Oriented

     Fig. 1. Excerpt of a Product Comparison Matrix on programming languages
     (http://en.wikipedia.org/wiki/Comparison of programming languages, July 2014)


     2.3   Processing PCMs to get formal context

     If we want to automatically build concept lattices from this kind of descriptions,
     we need to make the cell values respect a format in order to extract and process
     them. In [17], authors identify height different types of cell values:

      – yes/no values that indicate if the criterion is satisfied or not,
      – constrained yes/no values when the criterion is satisfied under conditions,
      – single-value when the criterion is satisfied using this value,
      – multi-values when several values can satisfy the criterion,
      – unknown value when we do not know if the criterion is satisfied,
      – empty cell,
      – inconsistent value when the value is not related to the criterion,
      – extra information when the cell value offers additional informations.

         We clean each type of cells as follows. Empty cells are not a problem, even
     though they indicate a lack of information. Inconsistent values should be de-
     tected, then clarified or erased. Unknown values and extra informations will be
     simply erased. Other types of cells could have either one single value or a list
     of values. We will always use a coma as value separator. Values with the same
     meanings will be written in the same way. Once we have applied these rules on
     a PCM, we consider that this PCM is cleaned. A cleaned PCM can own three
     types of features:

      – simple boolean,
      – constrained boolean,
      – non-boolean.

         Since automatic process is difficult on original PCMs, we clean them manu-
     ally. When we clean the PCM in Figure 1, we obtain the PCM in Figure 2.
         We can now automatically extract values from cleaned PCMs to generate
     formal contexts. Given a set of objects and a set of binary attributes, a formal
     context is a binary relation that states which attributes are possessed by each
     object. In summary, we want to convert a set of multivalued features (PCM)
     into a set of binary attributes (formal context).
         Scaling technique [8] consists in creating a binary attribute for each value
     (or group of values) of a multivalued feature. Boolean features can produce
     a single attribute to indicate if either or not the object owns this feature. Yet
         Variability representation in product lines using concept lattices      97

   Language   Standardized                      Paradigm
   Java           Yes            Functional, Imperative, Object-Oriented
   Perl           No               Functional, Procedural, Imperative
   Php            No       Functional, Procedural, Imperative, Object-Oriented
   C#             Yes      Functional, Procedural, Imperative, Object-Oriented

                    Fig. 2. PCM in Figure 1 cleaned manually


because of empty cells, the fact that an object does not possess an attribute could
also mean that the cell from the PCM was left blank. To be more precise, we
can choose to generate two attributes: one to indicate that the object possesses
the feature, and one to indicate that the object does not possess the feature.
Constrained boolean features can be processed in the same way than simple
boolean features by producing one or two attributes, with the difference that we
can keep constraints in the form of attributes. Non-boolean features will simply
produce an attribute per value or group of values. We applied scaling technique
on the cleaned PCM of Figure 2 and got the formal context in Figure 3.
                                           Standardized:Yes




                                           Object -Oriented
                                           Standardized:No

                                           Procedural
                                           Imperative
                                           Functional
                                Language




                                Java x x    x x
                                Perl  x x x x
                                Php   x x x x x
                                C# x x x x x

     Fig. 3. Formal context obtained after scaling the cleaned PCM in Figure 2


    The structure of concept lattices and AOC-posets permits to highlight inter-
esting properties from variability point of view: they classify objects depending
on their attributes and emphasise relations between these attributes (e.g. require,
exclude). For instance: attributes introduced in the top concept are owned by
all objects; attributes which are introduced in the same concept always appear
together; if an object o1 is introduced in a sub-concept of a concept introducing
an object o2 , o1 possesses all the attributes of o2 and other attributes; two ob-
jects introduced in the same concept possess the same attributes. Feature models
show part of this information, mainly reduced to relations between attributes,
as they do not include the products in the representation.
    Figure 7 represents the concept lattice from the formal context of Figure 3.
Figure 4 represents the AOC-poset from the formal context of Figure 3. In these
two structures, we can see that: all languages permit to write programs according
98       Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez

     to functional and imperative paradigms; the product Php has all the attributes
     of Perl in addition to the attribute Object Oriented ; all standardised languages
     are object-oriented.




                   Fig. 4. AOC-poset from the formal context of Figure 3




     3     Relational Concept Analysis and interconnected
           product lattices

     This section proposes an approach to model the case where a PCM possesses
     a feature whose value domain corresponds to a set of products described by
     another PCM.


     3.1    Modeling interconnected families of products with RCA

     We illustrate the modeling of interconnected families of products with an exten-
     sion of our example. We can find on Wikipedia a PCM on Wikisoftwares that
     refers to Programming Languages: we want to structure wikisoftwares accord-
     ing to programming languages in which they are written. We assume a PCM
     about wikisoftwares that owns a boolean feature Open Source and a constrained
          Variability representation in product lines using concept lattices       99

boolean feature Spam Prevention. We applied Section 2 approach and obtained
the formal (objects-attributes) context in Figure 5. Figure 8 presents the concept
lattice associated with the context in Figure 5.


        Wikisoftware OS:Yes OS:No SP:Yes SP:No SP:Captcha SP:Blacklist
        TeamPage              x     x              x
        Socialtext                  x
        MindTouch             x     x
        DokuWiki       x            x                          x
        EditMe                x     x              x

                 Fig. 5. Objects-attributes context of Wikisoftwares


    Relational Concept Analysis (RCA) [10] extends FCA to take into account
relations between several sets of objects. Each set of objects is defined by its
own attributes (in an objects-attributes context) and can be linked with other
sets of objects. A relation between two sets of objects is stated by a relational
context (objects-objects context). A relational context is a 3-tuple (O1 , O2 , I)
where O1 (source) and O2 (target) are two sets of objects such that there are
two formal contexts (O1 , A1 , R1 ) and (O2 , A2 , R2 ), and where I ⊆ O1 × O2 is a
binary relation.
    We want to express the relation isWritten between objects of Wikisoftwares
and objects of Programming languages. TeamPage and EditMe are written in
Java, SocialText in Perl, DokuWiki in Php and Mindtouch in both Php and
C#. We link each wikisoftware with corresponding programming languages in
an objects-objects context, and we present it in Figure 6.


                           isWritten Java Perl Php C#
                           TeamPage   x
                           Socialtext      x
                           MindTouch            x x
                           DokuWiki             x
                           EditMe     x

Fig. 6. Objects-objects context expressing the relation between objects of Wikisoft-
wares and Programming languages




3.2   Processing interconnected families of products

Given an objects-objects context Rj = (Ok , Ol , Ij ), there are different ways for
an object from Ok domain to be in relation with a concept from Ol . For instance:
an object is linked (by Ij ) to at least one object of a concept’s extent (existential
100       Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez




      Fig. 7. Concept lattice of Programming languages, step 0 (built with RCAExplore:
      http://dolques.free.fr/rcaexplore.php)



      scaling); an object is linked (by Ij ) only to objects of a concept’s extent (universal
      scaling). For each relation of R, we specify which scaling operator is used.
          In RCA, objects-attributes contexts are extended according to objects-objects
      contexts to take into account relations between objects of different sets. For each
      objects-objects context Rj = (Ok , Ol , Ij ), RCA extends the objects-attributes
      context of the set of objects Ok by adding relational attributes according to
      concepts of the lattice associated with the objects-attributes Ol . Each concept
      c from Ol gives a relational attribute q r :c where q is a scaling operator and
      r is the relation between Ok and Ol . A relational attribute appears in a lattice
      just as the other attributes, with the difference that it can be considered like a
      reference to a concept from another lattice.
          As shown on the example, data are represented in a Relational Context Fam-
      ily (RCF), which is a tuple (K, R) such that K is a set of objects-attributes
      contexts Ki = (Oi , Ai , Ii ), 1 ≤ i ≤ n and R is a set of objects-objects contexts
        Variability representation in product lines using concept lattices    101




                  Fig. 8. Concept lattice of Wikisoftwares, step 0



Rj = (Ok , Ol , Ij ), 1 ≤ j ≤ m, with Ij ⊆ Ok × Ol . Given an objects-attributes
context K = (O, A, I), we define rel(K) the set of relations (objects-objects con-
texts) of R which have O for domain, and ρ a function which associates a scaling
operator to each objects-objects context of R. For each step, we extend the con-
text K by adding relational attributes from each context of rel(K): we obtain the
complete relational extension of K. When we compute the complete relational
extension of each context of K, we obtain the complete relational extension of
the RCF.
    RCA generates a succession of contexts and lattices associated with the RCF
(K, R) and ρ. In step 0, RCA generates lattices associated with contexts of K.
K 0 = K. In step e + 1, RCA computes complete relational extension of the K e
contexts. The obtained extended contexts (K e+1 ) possess relational attributes
which refer to concepts of lattices obtained in the previous step.
    In our example, the RCF is composed of Wikisoftware, Programming Lan-
guage and of the objects-objects context isWritten. When we compute the com-
plete relational extension of this RCF, we extend the objects-attributes context
of Wikisoftware with relational attributes which refer to each concept of Pro-
gramming language. Figure 7 and Figure 8 represent lattices of Wikisoftware and
Programming language at step 0. Figure 9 presents the concept lattice from the
102      Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez

      extended objects-attributes context of Wikisoftwares, at step 1. In this example,
      we cannot go further than step 1.




                        Fig. 9. Concept lattice of Wikisoftwares, step 1


          In Figure 9, relational attributes are read like references to concepts in the
      lattice of Programming languages at step 0 (Figure 7). An extended concept
      lattice gives us the same kind of informations that are emphasised in a basic
      concept lattice, but it takes into account attributes from other product families.
      This brings a new dimension to research and classification of products from a
      same family. In our example, it permits us to select a wikisoftware depending on
      the paradigm of its programming language. In Figure 9, we can read for instance:
       – DokuWiki (concept 1) is written in a programming language characterised
         by concepts 6, 7, 8, 3, 5 and 0 of Programming languages, corresponding
        Variability representation in product lines using concept lattices     103

   to attributes Standardized:No, Procedural, Functional, Object Oriented and
   Imperative;
 – Team Page and EditMe are equivalent products because they are introduced
   in the same concept (same attributes and same relational attributes);
 – a wikisoftware not Open Source is written in a standardised language;
 – an Open Source wikisoftware is written in an unstandardised language.


4     Assessing scalability of the approach
Until now, we proposed a first method to obtain a concept lattice from a PCM’s
description and a second method to depict the case where features possess their
own PCM with interconnected lattices. In this section, we evaluate these two
methods on existing PCMs from Wikipedia to get an order of magnitude of the
obtained structures size.
    The number of generated concepts from a formal context depends on the
number of objects, the number of attributes and the form of the context: this
number can reach 2min(|O|,|A|) with a lattice, and |O| + |A| for an AOC-poset.
    In the following tests, we generate both concept lattices and AOC-posets to
emphasise the impact of deleting concepts which introduce neither attributes nor
objects on the size of the structure. Each test was made twice, a first time with
full formal contexts (scaling technique giving two attributes for each boolean
feature, and keeping constraints in the form of attributes for constrained boolean
features) and a second time with reduced fomal contexts (scaling technique giving
one attribute for each boolean feature).

4.1    Scalability for non-connected PCMs (FCA)
Firstly, we analysed 40 PCMs from Wikipedia without taking into account rela-
tions between products. These 40 PCMs were converted into formal contexts with
the method of Section 2. Results are presented in Figure 10. We have analysed
1438 products, generated 4639 binary attributes and 26002 formal concepts.
    Most of the concept lattices possess between 50 and 300 concepts, but some
of them can reach about 5000 concepts: this number is very high, and it would be
difficult to quickly process these data. Reduced contexts (blue plots) give smaller
structures, but some of them remain considerable. Thus, results of AOC-posets
are encouraging: the highest number of concepts obtained is 161. Most of them
possess between 30 and 60 concepts.

4.2    Scalability for connected PCMs (RCA)
Secondly, we made the same type of tests on structures which have been extended
with a relation, using method of Section 3. We wanted to realise these tests on
a quite important number of relationships. Yet, it is simple to find PCMs on
Wikipedia but it is more difficult to automatically find relationships between
these PCMs. To simulate relations between PCMs we used in the first test,
104       Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez




                            Lattices (F) Lattices (R) AOC-posets (F) AOC-posets (R)
           Mean               427.375      131.35         50.3          41.025
           Median               174          62.5          50              34
           Minimum               21            8           16              8
           Maximum             4551         1336          161             148
           First Quartile        49         28.25        29.25            20.5
           Third Quartile      282.5       127.75        61.25           53.75

      Fig. 10. Box plots on the number of concepts obtained with concept lattices (top) and
      AOC-poset (bottom), from full formal contexts (green) and reduced formal contexts
      (blue)



      we choose to automatically generate random objects-objects contexts based on
      two existing relations we found on Wikipedia. We analysed these relations and
      found out that about 75% of objects are linked to at least another object and
      that 95% of linked objects are linked to a single other object. We formed 20
      pairs of objects-attributes contexts and generated object-objects contexts for
      each pair according to our analysis.
          Results are presented in Figure 11. Concept lattices are huge (some can
      reach 14000 concepts) whereas AOC-poset remain relatively affordable (about
      200 concepts).
          These results match with the products used in a very simplified form for
      illustrating the approach. In real data, the first existing relation is between Lin-
      uxDistribution (77 objects, 25 features) and FileSystem (103 objects, 60 fea-
      tures). With a lattice, we obtain 1174 concepts (full context) and 1054 con-
      cepts (reduced context). With an AOC-poset, we obtain 188 then 179 con-
      cepts. The second relation is between Wikisoftware (43 objects, 12 features)
      and Programming Language (90 objects, 5 features). With a lattice, we obtain
      282 concepts (full context) and 273 concepts (reduced context). With an AOC-
      poset, we obtain 87 and then 86 concepts. All the datasets (extracted from
        Variability representation in product lines using concept lattices      105




                      Lattices (F) Lattices (R) AOC-posets (F) AOC-posets (R)
     Mean               2263.5        878.2         89.95          72.55
     Median               793          194            82            61.5
     Minimum               61           24            33             16
     Maximum             14083        7478           201            189
     First Quartile     359.25       108.25          67.5          47.25
     Third Quartile     2719.5       597.75        101.25          87.25

Fig. 11. Box plots on the number of concepts obtained with formal contexts extended
with RCA


wikipedia and generated) are available for reproducibility purpose at: http:
//www.lirmm.fr/recherche/equipes/marel/datasets/fca-and-pcm.
    According to these results, we can deduce that the concept lattices obtained
possess a majority of empty concepts (which introduce neither attributes nor
objects): AOC-poset appears like a good alternative to generate less complex
structures, and keeps variability informations in the same way that concept lat-
tices. These results on product description datasets, that are either real datasets,
or datasets generated with respect to existing real one profile, allow us to think
that is is realistic to envision using FCA and RCA for variability representation
within a canonical form integrating features and products.


5   Related work

To our knowledge, the first research work using Formal Concept Analysis to
analyse relationships between product configurations and variable features to
assist construction of a product line has been realised in Loesh and Ploederer
[13]. In this approach, the concept lattice is analysed to extract informations
about groups of features always present, never present, always present together
or never present together. Authors use these informations to describe constraints
106       Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez

      between features and propose restructurations of these features (merge, removal
      or identification of alternatives feature groups). In [14], authors study, in an
      aspect-oriented requirements engineering context, a concept lattice which clas-
      sifies scenarios by functional requirements. They analyse on the one hand the
      relation between concepts and quality requirements (usability, maintenance),
      and on the other hand interferences between quality requirements. Also, they
      analyse the impact of modifications. This analysis has been extended by obser-
      vations about product informations contained in the lattice (or the AOC-poset),
      and on the manner that some work implicitly use FCA, without mentioning it
      [1]. In the present work, we show how to extend the original approach, which
      analyses products described by features, to a more general case where there are
      features that are products themselves. Moreover, we evaluate the scale up of
      FCA and RCA on product families described by PCMs from Wikipedia and
      linked by relationships randomly generated.
           In the domain of product lines, another category of works is interested in
      identification of features in source code using FCA [19, 3]. In this case, described
      entities are variants of software systems which are described by source code and
      authors try to produce groups of source code elements that can be candidates to
      be features. Some works search for traceability links between features and the
      code [16]. In [7], authors cross-reference source code parts and scenarios which
      execute them and use features. The goal is to identify parts of the code which
      correspond to feature implementation. In our case, we do not work on source
      code, nor with scenarios, but with existing product descriptions.
           Finally, a last category of works study feature organisation in FM with FCA.
      Some approaches [20, 15] use conceptual structures (concept lattice or AOC-
      poset) to recognise contraints, but also to suggest sub-features typed relations
      linked to the domain. In the article [15], authors study in detail the extraction
      of implication rules in the lattice and cover relationships, to determine for in-
      stance if a set of features covers all the products. Recent works [2, 18] focus on
      emphasise logical relationships between features in a FM and on the identifica-
      tion of transverse constraints. These logical relationships are more specifically
      used in [18] to analyse the variability of a set of architectural configurations. In
      the present work, we generate a structure or several interconnected structures.
      These structures are created to analyse variability, but we do not consider issues
      about FM construction.


      6   Conclusion

      In this paper, we analyse the feasibility of using Formal Concept Analysis and
      Concept Lattices as a complement to Product Comparison Matrices to represent
      variability in product lines. We propose a method to convert a description from
      a product comparison matrix to a concept lattice using the scaling technique.
      We also propose a way to model the special case where a product family is de-
      scribed by another product family with Relational Concept Analysis. We obtain
         Variability representation in product lines using concept lattices        107

interconnected lattices that bring a new dimension to research and classification
of products when they are in relation with other product families.
    Subsequently, we realise series of tests to determine an order of magnitude of
the number of concepts composing the structures obtained firstly with FCA by
converting PCMs into formal contexts, and secondly with RCA by introducing
relations between these contexts. In these tests, we compare two structures:
concept lattices which establish a sub-order among all the concepts and AOC-
posets which establish a sub-order among the concepts which introduce at least
an attribute or an object. It seems that most of the concepts do not introduce
any information and AOC-poset appears like a more advantageous alternative, in
particular for presenting information to an end-user. Concept lattices are useful
too, when they are medium-size, and for automatic data manipulations. We also
show the effect of two different encoding of boolean values.
    In the future, we will use the work of [4] to automatise as much as possible
the conversion of PCMs. Moreover, researches on the classification process (using
different scaling operators) and its applications on variability will be performed
to complete this analysis. Also, a detailed study of the possibilities offered by
RCA to model other cases is considered. Finally, it could be interesting to study
transitions between different structures like FMs, PCMs, AOC-posets and con-
cept lattices to be able to select one depending on the kind of operation we want
to apply.


References
 1. Al-Msie ’deen, R., Seriai, A.D., Huchard, M., Urtado, C., Vauttier, S., Al-Khlifat,
    A.: Concept lattices: A representation space to structure software variability. In:
    ICICS 2014: The fifth International Conference on Information and Communica-
    tion Systems. pp. 1 – 6. Irbid, Jordan (Apr 2014), http://hal-lirmm.ccsd.cnrs.
    fr/lirmm-01075533
 2. Al-Msie’deen, R., Huchard, M., Seriai, A., Urtado, C., Vauttier, S.: Reverse engi-
    neering feature models from software configurations using formal concept analysis.
    In: Proceedings of the Eleventh International Conference on Concept Lattices and
    Their Applications, Košice, Slovakia, October 7-10, 2014. pp. 95–106 (2014)
 3. Al-Msie’deen, R., Seriai, A., Huchard, M., Urtado, C., Vauttier, S., Salman, H.E.:
    Mining features from the object-oriented source code of a collection of software
    variants using formal concept analysis and latent semantic indexing. In: Proc. of
    The 25th SEKE. pp. 244–249 (2013)
 4. Bécan, G., Sannier, N., Acher, M., Barais, O., Blouin, A., Baudry, B.: Automating
    the formalization of product comparison matrices. In: Proc. of the 29th ACM/IEEE
    ASE ’14. pp. 433–444 (2014)
 5. Clements, P.C., Northrop, L.M.: Software product lines: practices and patterns.
    Addison-Wesley (2001)
 6. Czarnecki, K., Eisenecker, U.W.: Generative programming: methods, tools, and
    applications. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA
    (2000)
 7. Eisenbarth, T., Koschke, R., Simon, D.: Locating features in source code. IEEE
    Trans. Softw. Eng. 29(3), 210–224 (2003)
108       Jessie Carbonnel, Marianne Huchard, and Alain Gutierrez

       8. Ganter, B., Wille, R.: Formal concept analysis - mathematical foundations.
          Springer (1999)
       9. Griss, M.L., Favaro, J., Alessandro, M.d.: Integrating Feature Modeling with the
          RSEB. In: Proceedings of the 5th International Conference on Software Reuse.
          pp. 76–. ICSR ’98, IEEE Computer Society, Washington, DC, USA (1998), http:
          //dl.acm.org/citation.cfm?id=551789.853486
      10. Hacène, M.R., Huchard, M., Napoli, A., Valtchev, P.: Relational concept analysis:
          mining concept lattices from multi-relational data. Ann. Math. Artif. Intell. 67(1),
          81–108 (2013)
      11. Kang, K.C., Cohen, S.G., Hess, J.A., Novak, W.E., Peterson, A.S.: Feature-oriented
          domain analysis (foda) feasibility study (November 1990)
      12. Kang, K.C., Kim, S., Lee, J., Kim, K., Shin, E., Huh, M.: Form: A feature-oriented
          reuse method with domain-specific reference architectures. Ann. Software Eng. 5,
          143–168 (1998)
      13. Loesch, F., Ploedereder, E.: Restructuring variability in software product lines us-
          ing concept analysis of product configurations. In: Proc. of the 11th IEEE ECSMR.
          pp. 159–170 (2007)
      14. Niu, N., Easterbrook, S.M.: Concept analysis for product line requirements. In:
          Proc. of the 8th AOSD 2009. pp. 137–148 (2009)
      15. Ryssel, U., Ploennigs, J., Kabitzsch, K.: Extraction of feature models from formal
          contexts. In: Proc. of ACM SPLC ’11. pp. 4:1–4:8 (2011)
      16. Salman, H.E., Seriai, A., Dony, C.: Feature-to-code traceability in a collection of
          software variants: Combining formal concept analysis and information retrieval.
          In: Proc. of the14th IEEE IRI. pp. 209–216 (2013)
      17. Sannier, N., Acher, M., Baudry, B.: From comparison matrix to variability model:
          The wikipedia case study. In: Proc. of the 28th IEEE/ACM ASE 2013. pp. 580–585
          (2013)
      18. Shatnawi, A., Seriai, A.D., Sahraoui, H.: Recovering architectural variability of a
          family of product variants. In: To appear in Proc. of the 14th ICSR (2015)
      19. Xue, Y., Xing, Z., Jarzabek, S.: Feature location in a collection of product variants.
          In: Proc. of the 19th IEEE WCRE. pp. 145–154 (2012)
      20. Yang, Y., Peng, X., Zhao, W.: Domain feature model recovery from multiple ap-
          plications using data access semantics and formal concept analysis. In: Proc. of the
          16th IEEE WCRE. pp. 215–224 (2009)