=Paper= {{Paper |id=None |storemode=property |title=Complete and Incomplete Approaches for Graph Mining |pdfUrl=https://ceur-ws.org/Vol-867/Paper35.pdf |volume=Vol-867 |dblpUrl=https://dblp.org/rec/conf/icwit/KemmarLOL12 }} ==Complete and Incomplete Approaches for Graph Mining== https://ceur-ws.org/Vol-867/Paper35.pdf
 Complete and incomplete approaches for graph
                   mining?

 Amina Kemmar 1 , Yahia Lebbah 1 , Mohammed Ouali 1 , and Samir Loudni 2

                     University of Oran, Es-Senia, Lab. LITIO,
                      B.P. 1524 EL-M’Naouar, Oran, Algeria
    2
      University of Caen - Campus II, Department of Computer Science, France
{kemmami,ylebbah}@yahoo.fr,mohammed.ouali@gmail.com,samir.loudni@unicaen.
                                         fr



        Abstract. In this paper, we revisit approaches for graph mining where
        a set of simple encodings is proposed. Complete approaches are those
        using an encoding allowing to get all the frequent subgraphs. Whereas
        incomplete approaches do not guarantee to find all the frequent sub-
        graphs. Our objective is also to highlight the critical points in the process
        of extracting the frequent subgraphs with complete and incomplete ap-
        proaches. Current canonical encodings have a complexity which is of
        exponential nature, motivating this paper to propose a relaxation of
        canonicity of the encoding leading to complete and incomplete encod-
        ings with a linear complexity. These techniques are implemented within
        our graph miner GGM (Generic Graph Miner) and then evaluated on
        a set of graph databases, showing the behavior of both complete and
        incomplete approaches.

        Keywords: Graph mining, frequent subgraph, pattern discovery, graph
        isomorphism.


1     Introduction

Graph-mining represents the set of techniques used to extract interesting and
previously unknown information about the relational aspect from data sets that
are represented with graphs.
    We revisit some approaches for graph mining where a set of simple encodings
is proposed. Complete approaches are those using an encoding enabling to get
all the frequent subgraphs. Whereas incomplete approaches do not guarantee
to find all the frequent subgraphs. Our objective is also to highlight the criti-
cal points in the process of extracting the frequent subgraphs. The introduced
techniques are implemented within GGM (generic graph miner). We provide an
experimentation with GGM showing the behavior of complete and incomplete
approaches. It is not proven if the canonical encoding of graphs is in the class of
NP-complete problems, nor in polynomial class. This is also verified in practice,
?
    This work is supported by TASSILI research program 11MDU839 (France, Algeria).




                                          312
2

since that all the current canonical encodings have complexities which are of
exponential nature. This motivates deeply our work on proposing a relaxation
of the canonicity of the encoding, leading us to what we qualified in this paper
complete and incomplete encodings with low polynomial complexities.
    The following section 2 introduces preliminaries on graph mining and the cur-
rent approaches to solve frequent subgraph discovery problem. Section 3 explains
our graph mining algorithm GGM. Experimental results of GGM are given in
section 4. Section 5 concludes the paper and addresses some perspectives.


2   Frequent subgraph discovery problem
An undirected graph G = (V, E) is made of the set of vertices V and the set of
edges E ⊆ V × V . Each edge (v1 , v2 ) is an unordered pair of vertices. We will
assume that the graph is labeled with vertex labels LV and edge labels LE ; the
same label can be assigned to many vertices (or edges) in the same graph. The
size of a graph G = (V, E) is defined to be equal to |E|.

Definition 1 (Frequent Subgraph discovery). Given a database G which
contains a collection of graphs. The frequency of a graph G in G is defined by
f req(G, G) = #{G0 ∈ G|G ⊆ G0 }. The support of a graph is defined by

                        support(G, G) = f req(G, G)/|G|.

    The frequent subgraph discovery problem consists to find all connected undi-
rected graphs F that are subgraphs of at least minsup|G| graphs of G:

                    F = {G ∈ G|support(G, G) ≥ minsup},

    for some predefined minimum support threshold minsup that is specified by
the user.

   Generally, we can distinguish between the methods of discovering frequent
subgraphs according to the way the three following problems are handled:

Candidates generation problem This is the first step in the frequent sub-
   graph discovery process which depends on the search strategy. It can be
   done with breadth first or depth first strategies. With breadth first strategy,
   all k-candidates (i.e., having k edges) are generated together, then (k + 1)-
   candidates and so on; making the memory consumption huge [4][3]. But with
   a depth approach, the k-candidates are iteratively generated, one by one.
Subgraph encoding problem When some new candidate is produced, we
   should verify that it has been already generated. This can be resolved by
   testing if this new candidate is isomorphic to one of the already generated
   subgraphs. The canonical DFS code [6] is usually used to encode the gen-
   erated frequent subgraphs. By this way, verifying that the new candidate is
   isomorphic to one of the already generated candidates is equivalent to testing
   if its encoding is equal to the encoding of some already generated candidate.




                                     313
                    Complete and incomplete approaches for graph mining??           3

Frequency computation problem If some new candidate is declared to be
   not isomorphic to any of the already produced candidates, we should com-
   pute its frequency. It could be done by finding all the graphs of the database
   which contain this new candidate.

   In the following section, we present a new algorithm GGM - Generic Graph
Miner - for finding connected frequent subgraphs in a graphs database. We pro-
pose also some simple encodings to handle efficiently the frequency counting
problem.


3     GGM, a generic graph miner
GGM finds frequent subgraphs, parameterized with some encoding strategies
detailed in section 3.1. It is generic, because we aim to make the key steps of
GGM easily parameterized.
Algorithm 1 GGM(G, fmin )
Require: G represents the graph dataset and fmin the minimum frequency threshold.
Ensure: F is the set of frequent subgraphs in G.
1: F ← ∅
2: E ← all frequent edge labels in G
3: N ← all frequent node labels in G
4: P ← Generate-Paths(N , E, G, fmin )
5: T ← Generate-Trees(P, E, G, fmin )
6: C ← Generate-Cyclic-Graphs(T , E, G, fmin )
7: F ← P ∪ T ∪ C
8: RETURN F

   The general structure of the algorithm is illustrated in algorithm 1. The
algorithm initializes the frequent subgraphs with all frequent edges and nodes
within the graph database G. Then, the algorithm proceeds with three separated
steps:
 1. enumerating frequent paths from the frequent nodes,
 2. generating the frequent trees from the frequent paths by keeping the same
    extremities of each initial path,
 3. extending the frequent paths and trees by adding an edge between two ex-
    isting nodes to obtain cyclic graphs.
   This approach is inspired from GASTON [5] in which these three steps are
repeated for each discovered subgraph. In other words, GASTON loops on the
above three steps, whereas in our approach, they are executed one time only.

3.1   Graph encoding
The canonical labeling is used to check whether a particular candidate subgraph
has already been generated or not. However, developing algorithms that can
efficiently compute the canonical labeling is critical to ensure that the mining
algorithm can scale to very large graph datasets. There exists different ways to
assign a code to a given graph, but it must uniquely identify the graph such




                                       314
4

that if two graphs are isomorphic, they will be assigned the same code. Such
encoding is called a canonical encoding. It is not proven if the canonical encoding
of graphs is in the class of NP-complete problems, nor in polynomial class. This
is also verified in practice, since that all the current canonical encodings have
complexities which are of exponential nature.
    The idea of our encoding is to use a non-canonical encoding, resulting in two
kinds of encodings : complete and incomplete.
Definition 2 (Complete and incomplete encodings). Let f be an encoding
function. For any two distinct non-isomorphic graphs G1 and G2 , f is complete
if f (G1 ) 6= f (G2 ). Otherwise, f is said to be incomplete.
DFS based complete encoding This encoding is a relaxation of that defined
  in [6]. Such encoding is processed by taking only one walk through a depth
  first search (and not the minimum as in [6]). It is straightforward that this
  encoding is complete, and the same graph can be generated several times
  as illustrated in Figure 1 which shows that two isomorphic graphs can have
  different codes. For the graph (a) in Figure 1, there exists several DFS codes.
  Two of them, which are based on the DFS trees in Figure 1(b)-(c) are listed
  in Table 1.
    edge      0                1               2               3               4                5
    Fig 1.(b) (1, 2, X, s, Y ) (2, 3, Y, t, X) (3, 1, X, s, X) (3, 4, X, q, Z) (4, 2, Z, t, Y ) (2, 5, Y, r, Z)
    Fig 1.(c) (1, 2, Y, s, X) (2, 3, X, s, X) (3, 1, X, t, Y ) (3, 4, X, q, Z) (4, 1, Z, t, Y ) (1, 5, Y, r, Z)
                               Table 1. DFS codes for Figure 1 (b)-(c)




                Fig. 1. Different DFS trees associated to the labeled graph (a)

   Since this encoding visits only once the edges, it is then straightforward that
   the worst-case complexity is O(m), where m is the number of edges.
Edge sequence based incomplete encoding Given a graph G and an edge
   eij = (vi , vj ) ∈ G, where deg(vi ) ≤ deg(vj ), the edge eij is represented by
   the 5-tuple: (deg(vi ), deg(vj ), lv (vi ), le (vi , vj ), lj (vj )),
   where deg(v) is the degree of v in the graph G, lv (v) and le (e) are the labels
   of the vertex v and the edge e respectively.
   Given a graph G, we denote SEQ-DEG(G) the sequence of its edge codes :
                          SEQ − DEG(G) = code(e1 )code(e2 )...code(e|E| )
      where ei 2000 0,01      0,60 4,29 0,01      0,26 0,68

Table 4. Results of Gaston and GGM (with the DFS encoding and SEQ-DEG) on the
graph dataset PTE. MinSup represents the minimum support threshold and MinFreq
the minimum frequency.

    We have done a comparison between our algorithm and the Gaston tool. Ta-
ble 3 (resp. Table 4) shows the results of our algorithm with the first database
(resp. second database). The minimum frequency was set to 1 for the first re-
sults. So, this table presents the runtimes to generate all the subgraphs of each
database. While for the second case, we choose different MinFreq expressed
also by MinSup (i.e. minimum support). For instance, for the PTE database
which contains 340 graphs, the number of graphs that are subgraphs of at least
1
    The     Predictive    Toxicology    dataset    (PTE)     can    be   downloaded   from
    http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/PTE.




                                           316
6

60% = 204
        340 graphs of PTE, is given by #T otal = #f req.paths + #f req.trees +
#f req.cyclic graphs. From this frequency, we see that there is no frequent cyclic
graphs (i.e. #freq. cyclic graphs = 0).
    Concerning the results on both databases, the complete encoding is usually
less performant than the incomplete one. This is explained by the fact that the
complete encoding handles larger set of candidates than the incomplete one. For
the incomplete encoding, the number of frequent subgraphs discovered by GGM
is not too far from that of Gaston. We notice also that from the frequency of
170 graphs, the result is the same.


5   Conclusion
In this paper, we have presented algorithm GGM for the frequent subgraph
discovery problem in a graph datasets. We pointed out the key points in the
graph mining process. We combined several strategies inspired from existing
algorithms to implement GGM. The two important points in the process of
discovery are the generation of new candidates and the frequency counting. Our
experimentations show the effectiveness of the incomplete approach compared
to the complete one. It shows the importance of handling a reasonable amount
of candidates. The main perspective is to improve our incomplete encoding. We
have to experiment our incomplete approach on huge graph databeses such as
those coming from chemistery. Actually, we have implemented the whole mining
algorithm in GGM and confront its performance to Gaston.


References
1. Bettina Berendt, Andreas Hotho, and Stum Gerd. Towards semantic web mining. In
   Proceedings of the First International Semantic Web Conference on The Semantic
   Web, ISWC ’02, pages 264–278, London, UK, UK, 2002. Springer-Verlag.
2. Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, and George Karypis. Fre-
   quent substructure-based approaches for classifying chemical compounds. IEEE
   Trans. on Knowl. and Data Eng., 17:1036–1050, August 2005.
3. Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm
   for mining frequent substructures from graph data. In Proceedings of the 4th Eu-
   ropean Conference on Principles of Data Mining and Knowledge Discovery, PKDD
   ’00, pages 13–23, London, UK, 2000. Springer-Verlag.
4. Michihiro Kuramochi and George Karypis. Frequent subgraph discovery. In Pro-
   ceedings of the 2001 IEEE International Conference on Data Mining, ICDM ’01,
   pages 313–320, Washington, DC, USA, 2001. IEEE Computer Society.
5. Siegfried Nijssen and Joost N. Kok. The gaston tool for frequent subgraph mining.
   Electron. Notes Theor. Comput. Sci., 127:77–87, March 2005.
6. Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. Order
   A Journal On The Theory Of Ordered Sets And Its Applications, 02:721–724, 2002.
7. Ken’ichi Yoshida and Hiroshi Motoda. Clip: concept learning from inference pat-
   terns. Artif. Intell., 75:63–92, May 1995.




                                     317