<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Polynomial Subgroup Discovery by means of FCA?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>y Buzm</string-name>
          <email>avbuzmakov@hse.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The goal of subgroup discovery is to find groups of objects that are significantly different than “average” object w.r.t. some supervised information. It is a computational intensive procedure that traverses a large searching space corresponding to the set of formal concepts. It was recently found that a part of formal concepts, called stable concepts, can be found in polynomial time. Accordingly, in this paper a new algorithm, called SD-SOFIA, is presented. SD-SOFIA fits subgroup discovery process in the framework of stable concept search. The proposed algorithm is evaluated on a dataset from UCI repository. It is shown that its practical computational complexity is polynomial.</p>
      </abstract>
      <kwd-group>
        <kwd>Subgroup Discovery</kwd>
        <kwd>Stable Concepts</kwd>
        <kwd>Supervised Learning</kwd>
        <kwd>Exploratory Data Analysis</kwd>
        <kwd>Algorithms</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Subgroup discovery is supervised data mining technique allowing for finding
groups of objects that express unexpected behavior w.r.t. some supervised
information [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For example, class labels of objects is an example of such supervised
information. Subgroup discovery not only enumerates the objects with
unexpected behavior but also describes them in a human readable form providing a
way for understanding the connection between the supervised information and
the description space. Such understanding is important for analysis of the real
process generating the supervised information.
      </p>
      <p>
        Formal concept analysis (FCA) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is a well-established mathematical
formalism suitable for description of subgroup discovery process. Indeed, extents
of formal concepts are subgroups, their intents are subgroup descriptions. Then
every concept can be evaluated by means of a quality function that relates object
class labels1 and the concept interest. Then the task of the subgroup discovery
in these terms is to find the concept with the best value of the quality function.
      </p>
      <p>
        One of the challenges in subgroup discovery is the exponentially large
searching space of concepts that entails two consequences. First, it is computationally
hard to find the best subgroups. Second, since the searching space is large, it is
? The reported study was funded by RFBR, project number 19-31-37001
1 For simplicity only classification task is considered.
likely that some its elements are associated with high values of quality function
just by chance, i.e., spurious findings are possible. There are a number of
approaches that deals with these consequences. In particular, the searching space
can be limited, e.g., it can be stated that only concepts with few attributes
constitute the searching space. Another possible way is to rely on a certain heuristic
such that it is unlikely to miss the best subgroups [
        <xref ref-type="bibr" rid="ref14 ref5">14, 5</xref>
        ]. Finally, some
approaches allow gradually refining the searching space in such a way that their
computation can be stopped at any moment and the best found result is
reported [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        The problem of spurious findings can be solved by controlling the statistical
significance of the examined subgroups [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Such approaches can be naturally
combined with the approaches that limit the searching space. By contrast,
combing statistically significant subgroup discovery with heuristic methods is hard.
      </p>
      <p>
        Accordingly, in this paper we discuss a limitation of the searching space that
is based on stability of a formal concept. This choice is motivated by two facts.
First, stability is a meaningful concept selection method. Indeed, stability is
the probability of a concept to be preserved after random deletion of objects
from the dataset. Thus, if just a small modification of the object set is enough
for removing a concept, then probably this concept can be removed from the
searching space. Second, it was recently shown that stability threshold can be
adjusted on the fly, allowing for polynomial time computational procedure [
        <xref ref-type="bibr" rid="ref6 ref8">8,
6</xref>
        ]. Consequently, combining this procedure with subgroup discovery give an
opportunity for controling the subgroup discovery computation time and allowing
for statistically significant subgroup discovery.
      </p>
      <p>Accordingly, the main contribution of this paper is the algorithm combining
the subgroup discovery process and the stability threshold adjustment allowing
for practical polynomial time complexity subgroup discovery with mathematical
guarantees for the resulting subgroup.</p>
      <p>
        The rest of the paper is organized as follows. Section 1 provides some basic
definitions, then in Section 2 the task is defined. Later the algorithms proposed
in this paper are discussed in Section 3. Finally, before concluding the paper the
algorithms are evaluated an a dataset from the UCI repository [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Definitions</title>
      <p>Subgroup discovery task statement depends (1) on the searching space and (2)
on the quality function. The searching space is defined by means of FCA, while
quality function is considered to be known and is not discussed in this paper.
1.1</p>
      <sec id="sec-2-1">
        <title>Formal concept analysis</title>
        <p>For simplicity the dataset is described as a formal context (G, M, I), where G is
the set of objects, M is the set of attributes, and I ⊆ G × M is a relation
between. The sets of objects and attributes are connected by means of a derivation</p>
        <p>A↑ = {B ⊆ M | ∀g ∈ A ((g, m) ∈ I)}, for A ⊆ G,
B↓ = {A ⊆ G | ∀m ∈ B ((g, m) ∈ I)}, for B ⊆ M.
(1)
(2)</p>
        <p>A formal concept is a pair (A, B), where A ⊆ G is callede extent and B ⊆ M
is called intent, such that A = B↓ and B = A↑. The set of concepts is ordered
w.r.t. the order on extents (or dually inverse order on intents). This order is a
lattice called concept lattice.
1.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Projections</title>
        <p>
          The algorithm presented in Section 3 is based on the notion of projections.
Originally, it was introduced within the framework of pattern structures [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
However, for the sake of simplicity, it is presented here in terms of standard
FCA.
        </p>
        <p>Definition 1. Projection ψ : 2M</p>
        <p>→ 2M is a function sutisfying:
– idempotency, i.e., ψ(ψ(X)) = ψ(X);
– monotonicity, i.e., X ⊂ Y −→ ψ(X) ⊆ ψ(Y );
– contractivity, i.e., X ⊇ ψ(X).</p>
        <p>In this paper, a special kind of projections is considered corresponding to
removal of certain attributes. In particular, ψY (X) = X ∩ Y , i.e., it removes all
attributes outside of Y . It can be seen, that the ψY is a projection. The larger
the set Y the more attributes are preserved after the projection.</p>
        <p>
          As it will be discussed later the algorithm starts from the most simple
projections, removing many attributes, and then iteratively adds attributes one by
one updating the result. A similar approach was used in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] where numerical
intervals were iteratively updated.
1.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Subgroup Discovery</title>
        <p>From the subgroup discovery (SD) point of view the concept lattice is the
searching space. The extent of a formal concept is the subgroup and the intent of a
formal context is the description of this subgroup. Given a quality function
Q : 2G → R, the goal of SD is to find the formal concept with extent maximizing
the quality function Q.</p>
        <p>
          Let us consider a standard quality function for the classification task [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
Let class labels of objects are given by a function class : G → {0, 1}, where
{0, 1} is the set of available class labels. Given a set of objects A, the weighted
relative accuracy can be expressed as
        </p>
        <p>A
Q1(A) = | |
|G| ·
|{g ∈ A | class(g) = 1}|</p>
        <p>A
| |
−
|{g ∈ G | class(g) = 1}| .</p>
        <p>|G|
2 The operators (·)↓ and (·)↑ are used instead of the standard (·)0, since it makes the
reading more clear w.r.t. the operator argument.</p>
        <p>This quality function express a trade-off between the size of the subgroup |A|
|G|
and the improvement in the “purity” of class labels within the concept w.r.t. the
average “purity”.</p>
        <p>
          Although it is possible to iterate over all formal concepts and compute their
quality, it is not efficient. Thus, a typical exhaustive subgroup discovery
procedure is implemented as a branch-and-bound search. It is based on the following
two ingredients [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]:
– A refinement operator r : 2G → 22G , that is monotone, i.e., given B0 ⊆ M ,
(∀B ∈ r(B0))B ⊆ B0. The refinement operator organizes the procedure of
generating new “more specific” candidate subgroups based on already found
ones.
– An optimistic estimator Q : 2G → R of the subgroup quality function Q.
        </p>
        <p>The optimistic estimator Q(A) is the upper bound of the quality function Q
on any subset of A, i.e., (∀S ⊆ A)Q(A) ≥ Q(S).</p>
        <p>Given these two components and the quality of the already found best
subgroup qbest the discovering procedure is as follows. For any found subgroup A
with description B0 = A↑ one can verify if any subset of A is a potentially
interesting subgroup, i.e., Q(A) &gt; qbest. If not the subgroup A can be ignored,
otherwise it is refined by means of the refinement operator r. The refined
subgroups with description B ∈ r(B0) are evaluated by means of the subgroup
quality function Q and if any subgroup is better then the already found one, the
best subgroup is updated.
1.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Formal Concept Stability and Δ-stability</title>
        <p>The branch and bound computational efficiency is still limited if the dataset is
large. The further improvement of the efficiency can be made by modification
of the searching space. One way is to consider only “stable” concepts as the
searching space.</p>
        <p>
          Stability of a formal concept [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] measures how strong the concept depend
on the dataset. If removal of random objects from the dataset is likely to remove
a concept then the concept is not stable.
        </p>
        <p>
          It was shown that, given a context and a concept, the computation of concept
stability is #P-complete [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Accordingly, in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] bounds on stability of a concept
was discussed, in particular, stability is bounded as follows:
1 −
        </p>
        <p>X
d∈DD(c)</p>
        <p>1 1
2Δ(c,d) ≤ Stab(c) ≤ 1 − d∈DD(c) 2Δ(c,d) ,
max
(3)
where DD(c) is the set of all direct descendants of a concept c in the lattice and
Δ(c, d) is the size of the set-difference between extent of c and extent of d, i.e.
Δ(c, d) = | Ext(c) \ Ext(d)|. It can be seen that stability in 3 is bounded tightly
if Δ(c) = min Δ(c, d) is high. Moreover, the higher the stability is, the tighter
d∈DD(c)
the bounds. Thus, in order to identify the most stable concepts one can switch
to Δ(c), called Δ-measure, which is polynomially computable.</p>
        <p>
          Recently, it was also shown, that given a Δ-measure threshold θ, it is possible
to directly find formal concepts with Δ-measure higher than θ [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Moreover,
introducing a special adjustment procedure, one is able to extract concepts with
the highest value of Δ-measure in polynomial time [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Accordingly, in this paper
this approach is translated to subgroup discovery task.
2
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Task Statement</title>
      <p>The goal of this paper is to present an algorithm for subgroup discovery task
modifying the searching space in such a way that its practical computational
complexity is polynomial.</p>
      <p>More formally, given a formal context K = (G, M, I) and a quality function
Q of a formal concept, the goal is an algorithm that finds the threshold θ of
Δ-measure and the corresponding best Δ-stable concept C, w.r.t. Q, such that
the practical computational complexity is polynomial w.r.t. K, T(Q), and the
memory usage limit L, where T(Q) is the computational complexity of Q, and L
is the maximal number of potential concepts to be extended at the same time.
3
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Algorithms</title>
      <sec id="sec-4-1">
        <title>Algorithm θ-Σοφια</title>
        <p>
          Let us first remind the original algorithm Σοφια [
          <xref ref-type="bibr" rid="ref6 ref8">8, 6</xref>
          ]. It is based on the following
observations.
        </p>
        <p>Proposition 1 (Projection antimonotonicity). Given a context K = (G, M, I),
for any projection ψ : 2M → 2M and a concept C = (A, B) it is valid that
Δ(G,M,I)(C) ≤ Δ(G,ψ(M),I) (ψ(B)↓, ψ(B)) .
(4)</p>
        <p>
          It should be noticed, that Δ-measure depends on the structure of the concept
lattice, i.e., Δ-measure should be computed when the lattice is fixed.
Accordingly, in (4) every Δ-measure is indexed with the corresponding formal
context. It also should be noticed that (ψ(B)↓, ψ(B)) is indeed a formal concept in
(G, ψ(M ), I) as it is shown earlier [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>This property (4) gives a way for direct search for Δ-stable concepts. Indeed,
a concept C = (A, B) is not Δ-stable in (G, ψ(M ), I) for some ψ, then any
preimages of B cannot be the intents of Δ-stable concepts in (G, M, I), i.e., any
concept with intent Bˆ, such that ψ(Bˆ) = B is not Δ-stable. Thus, if one is
able to find Δ-stable concepts in (G, ψ(M ), I), then only the preimages of these
concepts w.r.t. ψ can be Δ-stable in (G, M, I).</p>
        <p>However, how can one efficiently find Δ-stable concepts in (G, ψ(M ), I)?
Since (G, ψ(M ), I) is a context it is possible to consider a projection of it. It
forms a chain of projections and Δ-stable concepts are first found in the most
Data: A context K = (G, M, I), a chain of projections Ψ = {ψ0, ψ1, · · · , ψk},
and a threshold θ for Δ-measure.
1 Function ExtendProjection(i, θ, Pi−1)</p>
        <p>Data: Projection number i, a threshold value θ, and the set of concepts</p>
        <p>Pi−1 for the projection ψi−1.</p>
        <p>Result: The set Pi of all concepts with the value of Δ-measure higher
than the threshold θ for the projection ψi.</p>
        <p>Algorithm 1: The Θ-Σοφια algorithm for finding concepts in K with a
value of a Δ-measure higher than a threshold θ.
*/
*/
*/
general projections removing all attributes, then the next projections removes
all but one attribute, the next one removes all but two attributes, on so on.</p>
        <p>This procedure is shown in Algorithm 1 and is called θ-Σοφια algorithm. The
computational complexity of this algorithm is</p>
        <p>O |M | · 0&lt;mi≤a|xM| |Pi| · (T(Preimages) + T(Δ)) .</p>
        <p>It become polynomial if the threshold θ is adjusted in such a way that |Pi| &lt; L
for any i and a predefined memory limit L.
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Algorithm SD-SOFIA</title>
        <p>
          Usage of Algorithm 1 is limited for subgroup discovery, since it finds preimages
of all concepts from projection i to projection i + 1 simultaneously. By contrast,
the most commonly used strategy in subgroup discovery is expansion of the
most promising concept [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. This allows for earlier finding of concepts with high
quality Q improving the efficiency of branch cutting.
        </p>
        <p>It should be noticed that in Algorithm 1 finding preimages of a concept
does not depend on other concepts and, thus, concepts that are stored in P
can correspond to different projections. In this case, one is able to choose the
order of concept expansion and, in particular, it can be used for expanding first
Algorithm 2: The SD-SOFIA algorithm identifying the Δ-measure
threshold θ and the corresponding best concept w.r.t. a quality function
Q. The algorithm ensures the polynomial computational complexity.
the most promising concepts w.r.t. subgroup discovery task. This procedure is
shown in Algorithm 2. All concepts are stored in a queue. The queue can contain
concepts from different projections, thus, a concept is denoted as (A, B | i),
where A and B are the extent and the intent of the concept correspondingly,
and i is the projection it is computed in. The corresponding elements of a concept
c = (A, B | i) can be extracted by means of functions Ext, Int, and Proj for the
extent, the intent, and the projection number of c correspondingly.</p>
        <p>Let us first fix some order on attributes M . Let Mi be the first i attributes
from M . Then, this algorithm relies on the following chain of projections Ψ =
ψ0, ψ1, . . . , ψ|M| , where ψi(X) = X ∩ Mi, i.e., it removes all attributes but the
first i attributes from M . In line 2, Algorithm SD-SOFIA initializes the queue
with the only available concept in projection 0. This concept is (G, ∅). Indeed,
since M0 contain no attribute, the only available intent is ∅.</p>
        <p>Then, while queue is not empty the concepts are extracted one by one. For
subgroup discovery task the concepts are extracted (line 4) w.r.t. their potential,
i.e., the value of the optimistic estimate Q of the quality function Q. Then in
line 5 the preimages of the most promissing concept c = (A, B | i) are computed.
Since projection ψi+1 preserves one more attribute than projection ψi, there are
only two possible preimages: c1 = (A, B↓↑ | i + 1) and c2 = ((B ∪ {i})↓, (B ∪
{i})↓↑ | i + 1). The preimages c1 and c2 can coincide and, thus, in this case only
one of them should be considered.</p>
        <p>Then every preimage cc of the concept c is processed in lines 7–14. First in
lines 7–9 it is verified that cc is already in the last projection ψ|M|. Only in
this case the final Δ-measure value for this concept is known. Thus, only in this
moment it is possible decide if this concept can be reported as the best concept.
1 Function AdjustThld(queue, θ0)
2 queue ← {c ∈ queue | best.IsPromissing(c)};
3 if queue.Size &lt; |L| then
4 return θ0;
5 return min {θ &gt; θ0 | L ≥ |{c ∈ queue | Δ(c) &gt; θ}|};</p>
        <p>Algorithm 3: Adjustment of minimal threshold for Δ-measure.
If yes, in line 8 the quality of cc is checked and if it is high the concept is saved
as potentially best concept.</p>
        <p>In lines 10-11 the concept cc is checked. If the value of its Δ-measure is
smaller than the threshold θ, it is not Δ-stable concept and, thus, its preimages
cannot be Δ-stable either.</p>
        <p>Then, in lines 12–13 the concept cc is verified if its preimages can
potentially be reported as the best concepts, i.e., that the optimimstic estimate for
the quality function on its subconcepts (concepts with smaller but comparable
extents) is large enough. If yes, the concept cc is pushed to the queue in line 14
for further processing.</p>
        <p>Finally, in lines 15–16 the threshold θ is adjusted in such a way that ensures
that the size of the queue is smaller than the memory limit L (the parameter of
the algorithm). Let us discuss the adjustment in more details.
3.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Adjustment of Δ threshold</title>
        <p>The algorithm for adjusting the threshold is shown in Algorithm 3. First, in line
2 the algorithm removes all patterns that cannot generate concepts with high
quality. Then, if necessary it increases the threshold θ in such a way, that the
number of concepts with high Δ-measure is less then L.</p>
        <p>This adjustment is polynomial, however, the computational complexity of the
whole procedure is no more polynomial. Indeed, if the concept with the maximal
projection number is expanded, then this procedure become a depth first order
FCA algorithm, that requires only O(|M |) memory to store concepts and thus
if L &gt; |M | the adjustment procedure is never run and, thus, Algorithm 2 can
iterate over all concepts. However, if one always selects the concept with the
minimal projection number, then this procedure becomes the same as in
Algorithm 1 with the polynomial complexity. Indeed, in this case one can consider
the expansions of all concepts with the minimal projection number as one
expansion and it become exactly Σοφια algorithm. Similarly, if only expansion of the
concepts with small projection number is allowed ,i.e. if the projection number
is in the interval [imin, imin + k], where imin is the minimal projection number,
gives an intermediate worst-case computational complexity with a multiplier of
2k. For small k it is small and algorithm can be considered polynomial.</p>
        <p>The similar effect on computational complexity can be achieved by always
expanding concepts with the largest extent. Such kind of concept expansion
corresponds to subgroup discovery procedure. Indeed, the larger the extent, the
better the quality Q can be potentially attained on its subsets. Thus, there is a
hope, that in practice Algorithm 2 can behave as an algorithm with polynomial
computational complexity for subgroup discovery task.</p>
        <p>Before proceeding to practical evaluation we should discuss how the best
concepts should be registered, i.e. lines 8 and 12 of Algorithm 2.
3.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Best concept registration</title>
        <p>The most simple concept registration procedure can just verify that the quality of
the input concept is larger than the quality of the currently known best concept.
If the new concept is better, then it substitutes the previous best concept. This
procedure has a problem when the Δ-measure threshold θ is adjusted. Indeed,
let θ = 1 and the best concept found so far be cbest, let Δ(cbest) = 1. If on the
next step the stability threshold is adjusted, what should happen to cbest?</p>
        <p>
          If it is just preserved, then the result would not correspond to the final
projection. Thus, this would invalidate any procedure that compute the statistical
significance of the found subgroup [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Consequently, a special mechanism is
needed allowing for defining the best stable concept found so far for any
Δthreshold θ higher than the current threshold. Consequently, any concept that
is not dominated either by Δ-measure or by SD quality Q should be
registered. Indeed, if for two concepts Δ(c1) ≤ Δ(c2) and Q(c1) ≤ Q(c2), i.e., c1 is
dominated by c2, the concept c1 cannot be the best SD-concept for any θ. By
contrast, if Δ(c1) &lt; Δ(c2) and Q(c1) &gt; Q(c2), then for θ ≤ Δ(c1) the concept
c1 is Δ-stable and since Q(c1) &gt; Q(c2) the concept c1 is better than c2 and can
be reported as the best concept, however if Δ(c1) &lt; θ ≤ Δ(c2), then c1 is no
more Δ-stable and thus it cannot be reported as the best concept, but c2 is still
can be reported. Thus, all undominated concepts should be preserved. Thus,
Algorithm 4 describes how one should register the best concepts.
        </p>
        <p>In line 2 of Algorithm 4 an element of the best concept storage is found.
This element is such that best[i − 1] &lt; Δ(cc) ≤ best[i] (the set best of the
best concepts is ordered w.r.t. Δ), i.e., best[i] dominates cc w.r.t. Δ. Thus, if
Q(cc) &lt; Q(best[i]) the concept cc should not be registered (lines 3–4). Simlarly,
in lines 14–15 if the potential of cc is small then it is not promising.</p>
        <p>If the concept cc is not dominated by best[i] it is inserted into best in lines
5–8. Finally, the quality of cc can be so high that it dominates some concepts
from best (the concepts best[j] for j &lt; i are already dominated w.r.t. the
Δ-measure, thus, they should be compared by means of the quality function).
The operations best.FindInsertPositions, best.Insert, and best.Remove are
standard operations and can be efficiently implemented by means of red-black
trees and other standard approaches.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>
        Let us now check how the proposed approach behaves on real data. The approach
is evaluated on chess dataset from the UCI machine learning repository [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
1 Function best.Register(cc)
2 i ← best.FindInsertPosition(Δ(cc));
3 if Q(best[i]) ≥ Q(cc) then
4 return;
5 if Δ(best[i]) == Δ(cc) then
6 best[i] ← cc;
7 else
8 best.Insert(i,cc);
9 i ← i − 1;
10 while Q(best[i]) &lt; Q(cc) do
11 best.Remove(i);
12 i ← i − 1;
13 Function best.IsPromissing(cc)
14 i ← best.FindInsertPosition(ΔProj(cc)(cc));
15 return Q(best[i]) &lt; Q(cc);
      </p>
      <p>Algorithm 4: Registration of undominated concepts w.r.t. Δ-measure
and SD quality Q.</p>
      <p>
        The dataset contains 3196 of objects, corresponding to positions in chess with
the supervised information (‘the whites win’ and ‘the whites do not win’). The
dataset contains a huge number of concepts and thus it is a good candidate for
computational efficiency test. For this task we run the subgroup discovery task
for the classification quality function from [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] discussed in Section 1.3.
4.1
      </p>
      <sec id="sec-5-1">
        <title>Computational time and the dataset size</title>
        <p>In the first experiment the size of the dataset is varied from 200 objects to the
whole dataset of 3196 objects. For every size a random sample from the original
dataset is taken. The memory limit L is also varied from 100 to 100000. The
result is shown in Figure 1a. Since it is interesting to see the difference w.r.t.
large interval of possible dataset sizes it is given in the log-log scale. However,
in order to judge the computational complexity of the approach the real values
should be converted to relative values. In particular the time is measured in
computational time needed for computing the smallest dataset with the same L.
For example, for L = 1000 the computation time for the right most point (the
whole dataset) is shown to be equal to 2.5, which means that it is in 22.5 = 5.6
times larger than the time needed for the smallest dataset for the same L. Then
the algrithm can be considered linear or better if it is below the function y = x.</p>
        <p>As it can be seen from the plot for all values of L the computational behaviour
of the approach is not worser than linear time complexity. We can also note that
the slope of all curves is never larger than 1, which also proves that it behaves
linearly w.r.t. the dataset size.
4
e
m
it
e
v
it
a
l
e
2
R
L=10000
L=1000
e
m
i
t40
e
v
it
a
l
e
R</p>
        <p>4
4 8 12
Dataset size (x200)
(a)
100 1000 10000
L (Memory limit x100)
(b)
In the second experiment the whole dataset is taken and the memory limit is
varying form 100 to 106. The result is shown in Figure 1b. As before it is given
in the log-log space. As we can see the slope of this curve is also never larger
than 1, i.e., the practical computational complexity of this approach is linear.</p>
        <p>In both experiments we can see that the slope is much smaller than 1 till a
certain moment. It could be explained by the fact that the memory limit is too
large and is not totally used. Indeed, when the datasets are small in the first
experiments, the total number of concepts is also small, and, thus, the availble
memory is too large. However, when the size of the dataset is increased, the total
number of concepts is also increased and then the memory limit L become to be
important. Similarly, when in the second experiment the memory limit is large
(L = 106), then the memory limit become to be less important.
Let us now briefly discuss how the quality of the found concepts depends on
different memory limits on the whole dataset. Setting the memory limit to 100,
allows finding the best concept with Δ measure of 106, with the extent size of
1002 and the quality of 0.121. By contrast, for L = 106, the Δ of the found
concept is 3, the extent size is 1326 and the quality is 0.157. We see that if we
can wait, a better concept can be found, however if the time is limited small L
threshold allows finding interesting subgroups much faster.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper a new approach to subgroup discovery was proposed. Although
it is not possible to prove polynomial complexity of the algorithm, in practice
for subgroup discovery task it shows polynomial behaviour, which is extremly
important for processing of large datasets.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Atzmueller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Subgroup discovery</article-title>
          .
          <source>Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery</source>
          <volume>5</volume>
          (
          <issue>1</issue>
          ),
          <fpage>35</fpage>
          -
          <lpage>49</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Belfodil</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belfodil</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaytoue</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Anytime subgroup discovery in numerical domains with guarantees</article-title>
          .
          <source>In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases</source>
          . pp.
          <fpage>500</fpage>
          -
          <lpage>516</lpage>
          . Springer (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Boley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldsmith</surname>
            ,
            <given-names>B.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghiringhelli</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vreeken</surname>
          </string-name>
          , J.:
          <article-title>Identifying consistent statements about numerical data with dispersion-corrected subgroup discovery</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          <volume>31</volume>
          (
          <issue>5</issue>
          ),
          <fpage>1391</fpage>
          -
          <lpage>1418</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Boley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grosskreutz</surname>
          </string-name>
          , H.:
          <article-title>Non-redundant Subgroup Discovery Using a Closure System</article-title>
          . In: Buntine,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Grobelnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            , Mladeni´c, D.,
            <surname>Shawe-Taylor</surname>
          </string-name>
          , J. (eds.)
          <article-title>Machine Learning and Knowledge Discovery in Databases</article-title>
          . pp.
          <fpage>179</fpage>
          -
          <lpage>194</lpage>
          . Springer, Berlin, Heidelberg (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bosc</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boulicaut</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raissi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaytoue</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Anytime discovery of a diverse set of patterns with monte carlo tree search</article-title>
          .
          <source>Data mining and knowledge discovery 32(3)</source>
          ,
          <fpage>604</fpage>
          -
          <lpage>650</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Buzmakov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Efficient Mining of Subsample-Stable Graph Patterns</article-title>
          .
          <source>In: 2017 IEEE International Conference on Data Mining (ICDM)</source>
          . pp.
          <fpage>757</fpage>
          -
          <lpage>762</lpage>
          . New Orlean, LA, USA (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Buzmakov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Scalable Estimates of Concept Stability</article-title>
          . In: Sacarea,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Glodeanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.V.</given-names>
            ,
            <surname>Kaytoue</surname>
          </string-name>
          , M. (eds.)
          <article-title>Formal Concept Analysis</article-title>
          ,
          <source>LNCS</source>
          , vol.
          <volume>8478</volume>
          , pp.
          <fpage>161</fpage>
          -
          <lpage>176</lpage>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Buzmakov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Fast Generation of Best Interval Patterns for Nonmonotonic Constraints</article-title>
          . In: Appice,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Rodrigues</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.P.</given-names>
            ,
            <surname>Santos Costa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Gama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Jorge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Soares</surname>
          </string-name>
          , C. (eds.)
          <source>Machine Learning and Knowledge Discovery in Databases, LNCS</source>
          , vol.
          <volume>9285</volume>
          , pp.
          <fpage>157</fpage>
          -
          <lpage>172</lpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asuncion</surname>
            ,
            <given-names>A.:</given-names>
          </string-name>
          <article-title>UCI Machine Learning Repository</article-title>
          [http://archive.ics.uci.edu/ml]. University of California, Irvine, School of Information and Computer
          <string-name>
            <surname>Sciences</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          :
          <article-title>Pattern Structures and Their Projections</article-title>
          . In: Delugach,
          <string-name>
            <given-names>H.S.</given-names>
            ,
            <surname>Stumme</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . (eds.)
          <article-title>Conceptual Structures: Broadening the Base, LNCS</article-title>
          , vol.
          <volume>2120</volume>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>142</lpage>
          . Springer (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wille</surname>
          </string-name>
          , R.:
          <source>Formal Concept Analysis: Mathematical Foundations</source>
          . Springer, 1st edn. (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          :
          <article-title>On stability of a formal concept</article-title>
          .
          <source>Annals of Mathematics and Artificial Intelligence</source>
          <volume>49</volume>
          (
          <issue>1-4</issue>
          ),
          <fpage>101</fpage>
          -
          <lpage>115</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lavraˇc</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>Kavˇsek</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flach</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Todorovski</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Subgroup Discovery with CN2- SD</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>5</volume>
          ,
          <fpage>153</fpage>
          -
          <lpage>188</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaki</surname>
            ,
            <given-names>M.J.:</given-names>
          </string-name>
          <article-title>Sampling frequent and minimal boolean patterns: theory and application in classification. Data mining and knowledge discovery 30(1</article-title>
          ),
          <fpage>181</fpage>
          -
          <lpage>225</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Pellegrina</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riondato</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vandin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>SPuManTE: Significant Pattern Mining with Unconditional Testing</article-title>
          .
          <source>In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining-KDD</source>
          . vol.
          <volume>19</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>