<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Complete and incomplete approaches for graph mining?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amina Kemmar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yahia Lebbah</string-name>
          <email>ylebbah@yahoo.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammed Ouali</string-name>
          <email>mohammed.ouali@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samir Loudni</string-name>
          <email>samir.loudni@unicaen</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Caen - Campus II, Department of Computer Science</institution>
          ,
          <country>France fr</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Oran, Es-Senia, Lab. LITIO</institution>
          ,
          <addr-line>B.P. 1524 EL-M'Naouar, Oran</addr-line>
          ,
          <country country="DZ">Algeria</country>
        </aff>
      </contrib-group>
      <fpage>312</fpage>
      <lpage>317</lpage>
      <abstract>
        <p>In this paper, we revisit approaches for graph mining where a set of simple encodings is proposed. Complete approaches are those using an encoding allowing to get all the frequent subgraphs. Whereas incomplete approaches do not guarantee to nd all the frequent subgraphs. Our objective is also to highlight the critical points in the process of extracting the frequent subgraphs with complete and incomplete approaches. Current canonical encodings have a complexity which is of exponential nature, motivating this paper to propose a relaxation of canonicity of the encoding leading to complete and incomplete encodings with a linear complexity. These techniques are implemented within our graph miner GGM (Generic Graph Miner) and then evaluated on a set of graph databases, showing the behavior of both complete and incomplete approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>Graph mining</kwd>
        <kwd>frequent subgraph</kwd>
        <kwd>pattern discovery</kwd>
        <kwd>graph isomorphism</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Graph-mining represents the set of techniques used to extract interesting and
previously unknown information about the relational aspect from data sets that
are represented with graphs.</p>
      <p>We revisit some approaches for graph mining where a set of simple encodings
is proposed. Complete approaches are those using an encoding enabling to get
all the frequent subgraphs. Whereas incomplete approaches do not guarantee
to nd all the frequent subgraphs. Our objective is also to highlight the
critical points in the process of extracting the frequent subgraphs. The introduced
techniques are implemented within GGM (generic graph miner). We provide an
experimentation with GGM showing the behavior of complete and incomplete
approaches. It is not proven if the canonical encoding of graphs is in the class of
NP-complete problems, nor in polynomial class. This is also veri ed in practice,
? This work is supported by TASSILI research program 11MDU839 (France, Algeria).
since that all the current canonical encodings have complexities which are of
exponential nature. This motivates deeply our work on proposing a relaxation
of the canonicity of the encoding, leading us to what we quali ed in this paper
complete and incomplete encodings with low polynomial complexities.</p>
      <p>The following section 2 introduces preliminaries on graph mining and the
current approaches to solve frequent subgraph discovery problem. Section 3 explains
our graph mining algorithm GGM. Experimental results of GGM are given in
section 4. Section 5 concludes the paper and addresses some perspectives.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Frequent subgraph discovery problem</title>
      <p>An undirected graph G = (V; E) is made of the set of vertices V and the set of
edges E V V . Each edge (v1; v2) is an unordered pair of vertices. We will
assume that the graph is labeled with vertex labels LV and edge labels LE; the
same label can be assigned to many vertices (or edges) in the same graph. The
size of a graph G = (V; E) is de ned to be equal to jEj.</p>
      <p>De nition 1 (Frequent Subgraph discovery). Given a database G which
contains a collection of graphs. The frequency of a graph G in G is de ned by
f req(G; G) = #fG0 2 GjG G0g. The support of a graph is de ned by
support(G; G) = f req(G; G)=jGj:</p>
      <p>The frequent subgraph discovery problem consists to nd all connected
undirected graphs F that are subgraphs of at least minsupjGj graphs of G:
F = fG 2 Gjsupport(G; G)
minsupg;
for some prede ned minimum support threshold minsup that is speci ed by
the user.</p>
      <p>
        Generally, we can distinguish between the methods of discovering frequent
subgraphs according to the way the three following problems are handled:
Candidates generation problem This is the rst step in the frequent
subgraph discovery process which depends on the search strategy. It can be
done with breadth rst or depth rst strategies. With breadth rst strategy,
all k-candidates (i.e., having k edges) are generated together, then (k +
1)candidates and so on; making the memory consumption huge [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. But with
a depth approach, the k-candidates are iteratively generated, one by one.
Subgraph encoding problem When some new candidate is produced, we
should verify that it has been already generated. This can be resolved by
testing if this new candidate is isomorphic to one of the already generated
subgraphs. The canonical DFS code [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is usually used to encode the
generated frequent subgraphs. By this way, verifying that the new candidate is
isomorphic to one of the already generated candidates is equivalent to testing
if its encoding is equal to the encoding of some already generated candidate.
Frequency computation problem If some new candidate is declared to be
not isomorphic to any of the already produced candidates, we should
compute its frequency. It could be done by nding all the graphs of the database
which contain this new candidate.
      </p>
      <p>In the following section, we present a new algorithm GGM - Generic Graph
Miner - for nding connected frequent subgraphs in a graphs database. We
propose also some simple encodings to handle e ciently the frequency counting
problem.
3</p>
    </sec>
    <sec id="sec-3">
      <title>GGM, a generic graph miner</title>
      <p>GGM nds frequent subgraphs, parameterized with some encoding strategies
detailed in section 3.1. It is generic, because we aim to make the key steps of
GGM easily parameterized.</p>
      <p>Algorithm 1 GGM(G, fmin)
Require: G represents the graph dataset and fmin the minimum frequency threshold.
Ensure: F is the set of frequent subgraphs in G.
1: F
2: E a;ll frequent edge labels in G
3: N all frequent node labels in G
4: P Generate-Paths(N , E, G, fmin)
5: T Generate-Trees(P, E, G, fmin)
6: C Generate-Cyclic-Graphs(T , E, G, fmin)
7: F
8: RETUPR[NTF[ C</p>
      <p>The general structure of the algorithm is illustrated in algorithm 1. The
algorithm initializes the frequent subgraphs with all frequent edges and nodes
within the graph database G. Then, the algorithm proceeds with three separated
steps:
1. enumerating frequent paths from the frequent nodes,
2. generating the frequent trees from the frequent paths by keeping the same
extremities of each initial path,
3. extending the frequent paths and trees by adding an edge between two
existing nodes to obtain cyclic graphs.</p>
      <p>
        This approach is inspired from GASTON [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] in which these three steps are
repeated for each discovered subgraph. In other words, GASTON loops on the
above three steps, whereas in our approach, they are executed one time only.
3.1
      </p>
      <p>Graph encoding
The canonical labeling is used to check whether a particular candidate subgraph
has already been generated or not. However, developing algorithms that can
e ciently compute the canonical labeling is critical to ensure that the mining
algorithm can scale to very large graph datasets. There exists di erent ways to
assign a code to a given graph, but it must uniquely identify the graph such
that if two graphs are isomorphic, they will be assigned the same code. Such
encoding is called a canonical encoding. It is not proven if the canonical encoding
of graphs is in the class of NP-complete problems, nor in polynomial class. This
is also veri ed in practice, since that all the current canonical encodings have
complexities which are of exponential nature.</p>
      <p>The idea of our encoding is to use a non-canonical encoding, resulting in two
kinds of encodings : complete and incomplete.</p>
      <p>De nition 2 (Complete and incomplete encodings). Let f be an encoding
function. For any two distinct non-isomorphic graphs G1 and G2, f is complete
if f (G1) 6= f (G2). Otherwise, f is said to be incomplete.
where ei &lt;l ei+1, the relation &lt;l de nes a lexicographic order between edges
(e.g. (2; 3; X; s; Y ) &lt; (2; 3; Z; s; Y )). The code associated to the graph (a) of
Figure 1 is: (1,4,Z,r,Y)(2,3,X,s,X)(2,3,Z,q,X)(2,4,X,s,Y)(2,4,Z,t,Y)(3,4,X,t,Y).
Enumerating the edges is done with O(m), where m is the number of edges,
but sorting lexicographically the edges requires O(m log(m)) which is the
worst case complexity of sorting algorithms. Thus, in nal, the complexity
of this encoding is O(m log(m)). This encoding is not complete because we
can nd examples of non-isomorphic graphs having the same code.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results</title>
      <p>We have done a comparison between our algorithm and the Gaston tool.
Table 3 (resp. Table 4) shows the results of our algorithm with the rst database
(resp. second database). The minimum frequency was set to 1 for the rst
results. So, this table presents the runtimes to generate all the subgraphs of each
database. While for the second case, we choose di erent MinFreq expressed
also by MinSup (i.e. minimum support). For instance, for the PTE database
which contains 340 graphs, the number of graphs that are subgraphs of at least
1 The Predictive Toxicology dataset (PTE) can
http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/PTE.
be
downloaded
from
60% = 230440 graphs of PTE, is given by #T otal = #f req:paths + #f req:trees +
#f req:cyclic graphs. From this frequency, we see that there is no frequent cyclic
graphs (i.e. #freq. cyclic graphs = 0).</p>
      <p>Concerning the results on both databases, the complete encoding is usually
less performant than the incomplete one. This is explained by the fact that the
complete encoding handles larger set of candidates than the incomplete one. For
the incomplete encoding, the number of frequent subgraphs discovered by GGM
is not too far from that of Gaston. We notice also that from the frequency of
170 graphs, the result is the same.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we have presented algorithm GGM for the frequent subgraph
discovery problem in a graph datasets. We pointed out the key points in the
graph mining process. We combined several strategies inspired from existing
algorithms to implement GGM. The two important points in the process of
discovery are the generation of new candidates and the frequency counting. Our
experimentations show the e ectiveness of the incomplete approach compared
to the complete one. It shows the importance of handling a reasonable amount
of candidates. The main perspective is to improve our incomplete encoding. We
have to experiment our incomplete approach on huge graph databeses such as
those coming from chemistery. Actually, we have implemented the whole mining
algorithm in GGM and confront its performance to Gaston.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Bettina</given-names>
            <surname>Berendt</surname>
          </string-name>
          , Andreas Hotho, and
          <string-name>
            <given-names>Stum</given-names>
            <surname>Gerd</surname>
          </string-name>
          .
          <article-title>Towards semantic web mining</article-title>
          .
          <source>In Proceedings of the First International Semantic Web Conference on The Semantic Web, ISWC '02</source>
          , pages
          <fpage>264</fpage>
          {
          <fpage>278</fpage>
          , London, UK, UK,
          <year>2002</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Mukund</given-names>
            <surname>Deshpande</surname>
          </string-name>
          , Michihiro Kuramochi, Nikil Wale, and
          <string-name>
            <given-names>George</given-names>
            <surname>Karypis</surname>
          </string-name>
          .
          <article-title>Frequent substructure-based approaches for classifying chemical compounds</article-title>
          .
          <source>IEEE Trans. on Knowl. and Data Eng</source>
          .,
          <volume>17</volume>
          :
          <fpage>1036</fpage>
          {
          <fpage>1050</fpage>
          ,
          <year>August 2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Akihiro</given-names>
            <surname>Inokuchi</surname>
          </string-name>
          , Takashi Washio, and
          <string-name>
            <given-names>Hiroshi</given-names>
            <surname>Motoda</surname>
          </string-name>
          .
          <article-title>An apriori-based algorithm for mining frequent substructures from graph data</article-title>
          .
          <source>In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery</source>
          ,
          <source>PKDD '00</source>
          , pages
          <fpage>13</fpage>
          {
          <fpage>23</fpage>
          , London, UK,
          <year>2000</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Michihiro</given-names>
            <surname>Kuramochi</surname>
          </string-name>
          and
          <string-name>
            <given-names>George</given-names>
            <surname>Karypis</surname>
          </string-name>
          .
          <article-title>Frequent subgraph discovery</article-title>
          .
          <source>In Proceedings of the 2001 IEEE International Conference on Data Mining, ICDM '01</source>
          , pages
          <fpage>313</fpage>
          {
          <fpage>320</fpage>
          , Washington, DC, USA,
          <year>2001</year>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Siegfried</given-names>
            <surname>Nijssen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Joost N.</given-names>
            <surname>Kok</surname>
          </string-name>
          .
          <article-title>The gaston tool for frequent subgraph mining</article-title>
          .
          <source>Electron. Notes Theor. Comput. Sci.</source>
          ,
          <volume>127</volume>
          :
          <fpage>77</fpage>
          {
          <fpage>87</fpage>
          ,
          <string-name>
            <surname>March</surname>
          </string-name>
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Xifeng</given-names>
            <surname>Yan</surname>
          </string-name>
          and Jiawei Han.
          <article-title>gspan: Graph-based substructure pattern mining</article-title>
          .
          <source>Order A Journal On The Theory Of Ordered Sets And Its Applications</source>
          ,
          <volume>02</volume>
          :
          <fpage>721</fpage>
          {
          <fpage>724</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ken</surname>
          </string-name>
          <article-title>'ichi Yoshida and Hiroshi Motoda. Clip: concept learning from inference patterns</article-title>
          .
          <source>Artif</source>
          . Intell.,
          <volume>75</volume>
          :
          <fpage>63</fpage>
          {
          <fpage>92</fpage>
          , May
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>