<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Stable Significant Subgroup Discovery?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jyoti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>y Buzm</string-name>
          <email>AVBuzmakov@hse.ru</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology Mandi</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <addr-line>Perm</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>287</fpage>
      <lpage>292</lpage>
      <abstract>
        <p>Discovering subgroups with significant association with binary class labels has wide applications in drug discovery, market basket analysis, etc. The state-of-the-art technique, TopKWY, which mines the top-k significant subgroups does not scale to large datasets, especially, when the search space of concepts is very large. In this paper, we propose SD-SOFIA, an algorithm that mines stable significant subgroups rather than just significant subgroups. SD-Sofia is able to mine the same significant subgroup or subgroup with comparable quality to TopKWY by navigating only a reduced search space. We have verified the result in 19 real-world datasets. This insight gives us an opportunity to design efficient and scalable algorithm for finding statistically significant subgroup in large datasets. The quality of the pattern mined and the time taken by our algorithm is governed by the initial delta threshold value. From experiments, we show that when initial delta threshold value is set between 0.5 to 3 percent of the number of objects in the dataset, our algorithm generates a pattern with comparable quality as TopKWY.</p>
      </abstract>
      <kwd-group>
        <kwd>Stable significant pattern mining</kwd>
        <kwd>Subgroup discovery</kwd>
        <kwd>Delta stable</kwd>
        <kwd>SD-SOFIA</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Subgroup discovery has many applications in drug discovery, market basket
analysis, and technical domains [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Unlike concept mining in Formal Concept
Analysis (FCA) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], where we find closed patterns from unlabeled datasets, subgroup
discovery aims to mine patterns having a high association with a class label. A
special subarea of subgroup discovery, called significant pattern mining, mines
patterns which are considered significant by means of some statistical test [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>The total number of statistically significant patterns in a dataset can be
large. Thus, one may be interested in mining only the most significant patterns.</p>
      <p>
        TopKWY [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is the state-of-the-art algorithm that can directly mine the top-k
significant patterns without exploring all the significant patterns. It
automatically adjusts the significance threshold and the support threshold during concept
exploration and thereby attains better pruning of the search space. Even with
these optimizations, TopKWY can take significant amount of time for large
datasets due to the large search space of closed patterns. In practical
applications, where one is interested in mining just a few, most significant patterns,
time is the primary concern. For example, in subgroup discovery mining, only
one target concept of considerable quality is required; it is crucial to mine this
concept as fast as possible.
      </p>
      <p>
        In this paper, we propose SD-SOFIA, an algorithm that mines stable
significant subgroups rather than just significant subgroups. It is derived from
SOFIA [
        <xref ref-type="bibr" rid="ref3 ref5">5, 3</xref>
        ], an algorithm for mining patterns whose Δ-measure is greater than
a specified threshold. SD-SOFIA uses value obtained from an optimistic
estimator, in addition to the Δ-measure to expand the most promising concepts w.r.t.
a subgroup discovery task. This leads to faster discovery of the most stable
significant subgroup. We show that in standard datasets, SD-SOFIA can mine the
most stable significant pattern with more or less the same quality as the most
significant pattern found by TopKWY from the reduced search space of Δ-stable
patterns.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Background and Related Work</title>
      <p>
        Formal context is defined as triplet K = (G, M, I), where G is a set of objects,
M is a set of attributes (or descriptions), and I ⊆ G × M is a binary relation
between G and M [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. A pattern (or itemset) B is a subset of the attribute set
M and said to be closed if there exists no superset of B which is present in all
the objects containing B. Support of a pattern B is defined as the number of
objects containing B.
      </p>
      <p>
        As the number of closed patterns is extremely large, pattern mining
algorithms usually mine only patterns w.r.t. an interestingness measure. One of the
most promising interestingness measure is stability but it is hard to compute [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
So an estimate of stability; Δ-measure is proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Given a closed pattern
B ⊆ M , Δ-measure is defined as Δ(B) = minA⊃B(Supp(B) − Supp(A)). Here,
Supp(X) is a function that returns the support of the pattern X. This paper
introduces a new algorithm by exploring the search space of Δ-stable patterns,
i.e., closed patterns with Δ-measure greater than a specified threshold θ.
      </p>
      <p>
        In subgroup discovery, a binary class label c ∈ {0, 1} is associated with each
object. Following the authors of TopKWY [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we rely on Fisher exact test to
evaluate the association between a concept and the class labels. Moreover, this
test allows for efficient branch cutting during the search for the best concept
w.r.t. its association to class labels. This test gives the p-value in the interval
[
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]. The smaller the p-value is, the better the significance of the concept.
      </p>
      <p>
        Related Work
In significant pattern mining just fixing the significance threshold to some value
may lead to large number of false positives while testing multiple hypotheses.
Llinares et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] addressed this issue by first finding the appropriate
significance threshold and then discovering significant patterns w.r.t. this threshold.
Although, it works correctly for multiple hypothesis testing, the search space
of significant patterns can still be huge. Pellegrina et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] in their algorithm
TopKWY performs automatic adjustment of the upper bound on the support
of significant patterns. Thus, it prunes the insignificant (or untestable) patterns
faster. However, it has been observed that in large datasets, this upper bound
gets adjusted to a very small value. Thereby, TopKWY takes huge amount of
time to mine the most significant pattern. In this paper, we present an algorithm
SD-SOFIA, which is based on SOFIA algorithm [
        <xref ref-type="bibr" rid="ref3 ref5">5, 3</xref>
        ] to mine the most Δ-stable
significant pattern. As our algorithm considers only Δ-stable concepts, it can
potentially give rise to results having lower quality compared to TopKWY.
However, from experiments on standard datasets we observe that SD-SOFIA gives
patterns with comparable quality from the reduced search space of Δ-stable
patterns.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Proposed Algorithm</title>
      <p>Let us first fix some order on attributes M . For the sake of simplicity in this
paper, by projection i we mean the dataset limited to having only first i attributes.
Let Mi be the first i attributes from M .</p>
      <p>
        The recent algorithm SOFIA [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is based on Δ-stable concepts in projection i
and the new attribute i + 1 computes Δ-stable context in the projection i + 1.
It should be noticed, that this strategy is limited for subgroup discovery, since
it finds preimages of all concepts from projection i to projection i + 1
simultaneously. However, the most commonly used strategy in subgroup discovery is
expansion of the most promising concept [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This allows for earlier finding of
concepts with high quality Q improving the efficiency of branch cutting.
      </p>
      <p>In algorithm SOFIA, finding preimages of a concept does not depend on other
concepts and, thus, concepts that are stored in P can correspond to different
projections. Thus, it is not necessary to move all concepts from projection i to
projection i + 1, i.e., only the most promising concept can be moved to the next
(w.r.t. this concept) projection. This procedure is shown in Algorithm 1. All
concepts are stored in the queue. The queue can contain concepts from different
projections, thus, a concept is denoted as (A, B | i), where A and B are the
extent and the intent of the concept correspondingly, and i is the projection it
is computed in. The corresponding elements of a concept c = (A, B | i) can be
extracted by means of functions Ext(c), Int(c), and Proj(c) for the extent, the
intent, and the projection number of c correspondingly.</p>
      <p>In line 2, Algorithm SD-SOFIA initializes the queue with the only available
concept in projection 0. This concept is (G, ∅). Indeed, since M0 contains no</p>
      <p>Algorithm 1: Algorithm SD-SOFIA identifying the Δ-measure
threshold θ and the corresponding best concept w.r.t. a quality function Q.
1 Function FindBestConcept()
2 queue.Push ((G, ⊥ | 0));
3 while not queue.isEmpty() do
4 c ← queue.PopTheMostPromising ();
5 {ci} ← Preimages(c);
6 foreach cc ∈ {ci} do
7 if Proj(cc) = |M | then
8 best.Register(cc);
9 next;
10 if not ΔProj(cc)(cc) &lt; θ then
11 next;
12 if not best.IsPromising(cc) then
13 next;
14 queue.Push(cc);
/* Projection number 0 */
attribute, the only available intent is ∅. The algorithm iterates while queue is
not empty. The concepts are extracted one by one (line 4) w.r.t. their potential,
i.e., the value of the optimistic estimate Q of the quality function Q. Then in line
5 the preimages of the most potential concept c = (A, B | i) are computed. Since
projection ψi+1 preserves one more attribute than projection ψi, there are only
two possible preimages: c1 = (A, B↓↑ | i + 1) and c2 = ((B ∪ {i})↓, (B ∪ {i})↓↑ |
i + 1). The preimages c1 and c2 can coincide.</p>
      <p>Then every preimage cc of the concept c is processed in lines 7–14. First
in lines 7–9 it is verified that cc is already in the last projection ψ|M|. Only
in this case the final Δ-measure value for this concept is known. Thus, only in
this moment it is possible to decide if this concept can be reported as the best
concept. Then in lines 10-13 the concept cc is checked if it can produce stable
concepts and if it can produce a better concept than the already found one.
Here, in line 10 the Δ-measure is indexed with the projection number, since the
Δ-measure even for the same concept in different projection can change.</p>
      <p>The correctness of the algorithm follows from the fact that different concepts
can be stored in different projections but every concept is passing exactly the
same strategy as in algorithm SOFIA.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Evaluation</title>
      <p>
        We compare the performance of SD-SOFIA with TopKWY [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] in terms of pattern
quality and total execution time on standard datasets available from FIMI4,
      </p>
      <sec id="sec-4-1">
        <title>4 http://fimi.ua.ac.be</title>
        <p>chess
connect</p>
        <p>ijcnn1
Datasets
covtype
ushroom
m
susy
accidents</p>
        <p>
          UCI5, SPMF6, and libSVM7. We did experiment on all 19 datasets used in
TopKWY [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] with the number of attributes varying from 16 to 330285 and the
number of objects varying from 1243 to 5000000. Same as TopKWY, SD-SOFIA
is implemented in C++. We performed the experiments on a machine having
10-cores, 2.30GHz Intel(R) Xeon CPU, 128GB RAM; the OS is Ubuntu 16.04.6
LTS.
        </p>
        <p>To mine the most significant pattern using TopKWY8, we set the value of K
to one while retaining default values for the other parameters. For SD-SOFIA,
we vary Δ-stability threshold, θ, from 0.5% to 3%. The performance of
SDSOFIA depends on the Δ-stability threshold θ. Below 0.5%, SD-SOFIA ends
up exploring the entire space of closed patterns and takes too long to complete.
Above 3%, the quality of the resulting pattern in SD-SOFIA is quite poor.</p>
        <p>Fig. 1 compares the execution time of TopKWY and SD-SOFIA (without
permutation testing) on different datasets. x-axis shows the datasets and
yaxis shows the execution time of the algorithms using logarithmic scale. As we
have many datasets, we show the results only for 9 datasets. From Fig. 1, the
execution time for SD-SOFIA decreases with an increase in θ value. Similar
results are obtained for the remaining datasets. As θ value increases, the search</p>
      </sec>
      <sec id="sec-4-2">
        <title>5 https://archive.ics.uci.edu/ml/index.php</title>
        <p>6 http://www.philippe-fournier-viger.com/spmf
7 https://www.csie.ntu.edu.tw/ecjlin/libsvmtools/datasets
8 https://github.com/VandinLab/TopKWY</p>
        <p>Jyoti et al.
space of stable patterns gets shrunk. Hence, the execution time goes down; but
the quality of the resulting pattern obtained from SD-SOFIA also decreases.</p>
        <p>When the threshold is 0.5, for 13 out of 19 datasets, SD-SOFIA exactly
matched the quality obtained by TopKWY, i.e. the most stable significant
pattern was the same as the most significant pattern. Basically, we have a trade-off
between time and quality. As the Δ threshold increases, the running time
reduces, but the quality of the found pattern by SD-SOFIA also decreases and
vice-versa. By a similar procedure to TopKWY, we also verified that, if the
results of SD-SOFIA and TopKWY are different, the result found by SD-SOFIA is
always statistically significant. With naively implemented permutation testing,
SD-SOFIA’s (for θ = 3% of |G|) total execution time is comparable to TopKWY
in 13 datasets, an order of magnitude better than TopKWY in 2 datasets and
slower by an order of magnitude in 4 datasets.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>We have shown that stable patterns are good candidates for statistically
significant patterns. This insight is useful to scale to large datasets by navigating
only a reduced search space. From the results, it seems that 0.5-3 % of |G| is
a good choice for the threshold θ. Our future work is to improve efficiency and
verify the statistical power of SD-SOFIA in noisy datasets. Another interesting
extension is to mine the top-k stable significant patterns. We would also adopt
efficient permutation testing from TopKWY in SD-SOFIA.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Atzmueller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Subgroup discovery</article-title>
          .
          <source>WIREs: Data Mining and Knowledge Discovery</source>
          <volume>5</volume>
          (
          <issue>1</issue>
          ),
          <fpage>35</fpage>
          -
          <lpage>49</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Boley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grosskreutz</surname>
          </string-name>
          , H.:
          <article-title>Non-redundant Subgroup Discovery Using a Closure System</article-title>
          .
          <source>In: Machine Learning and Knowledge Discovery in Databases</source>
          . pp.
          <fpage>179</fpage>
          -
          <lpage>194</lpage>
          . Springer, Berlin, Heidelberg (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Buzmakov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Efficient Mining of Subsample-Stable Graph Patterns</article-title>
          .
          <source>In: 2017 IEEE International Conference on Data Mining (ICDM)</source>
          . pp.
          <fpage>757</fpage>
          -
          <lpage>762</lpage>
          . New Orlean, LA, USA (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Buzmakov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Scalable estimates of concept stability</article-title>
          . In: Glodeanu,
          <string-name>
            <given-names>C.V.</given-names>
            ,
            <surname>Kaytoue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Sacarea</surname>
          </string-name>
          , C. (eds.)
          <article-title>Formal Concept Analysis</article-title>
          . pp.
          <fpage>157</fpage>
          -
          <lpage>172</lpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Buzmakov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Fast Generation of Best Interval Patterns for Nonmonotonic Constraints</article-title>
          .
          <source>In: Machine Learning and Knowledge Discovery in Databases, LNCS</source>
          , vol.
          <volume>9285</volume>
          , pp.
          <fpage>157</fpage>
          -
          <lpage>172</lpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wille</surname>
          </string-name>
          , R.:
          <source>Formal Concept Analysis: Mathematical Foundations</source>
          . Springer, 1st edn. (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Llinares-Lo´pez,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Sugiyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Papaxanthos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Borgwardt</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          :
          <article-title>Fast and memory-efficient significant pattern mining via permutation testing</article-title>
          .
          <source>In: 21st ACM SIGKDD</source>
          . p.
          <fpage>725</fpage>
          -
          <lpage>734</lpage>
          . New York, NY, USA (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Pellegrina</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vandin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Efficient mining of the most significant patterns with permutation testing</article-title>
          .
          <source>In: 24th ACM SIGKDD</source>
          . p.
          <fpage>2070</fpage>
          -
          <lpage>2079</lpage>
          . ACM, New York, NY, USA (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>