-

Complete and incomplete approaches for graph mining?

Amina Kemmar

Yahia Lebbah

ylebbah@yahoo.fr 1

Mohammed Ouali

mohammed.ouali@gmail.com 1

Samir Loudni

samir.loudni@unicaen 0 1 0 University of Caen - Campus II, Department of Computer Science , France fr 1 University of Oran, Es-Senia, Lab. LITIO , B.P. 1524 EL-M'Naouar, Oran , Algeria

312 317

In this paper, we revisit approaches for graph mining where a set of simple encodings is proposed. Complete approaches are those using an encoding allowing to get all the frequent subgraphs. Whereas incomplete approaches do not guarantee to nd all the frequent subgraphs. Our objective is also to highlight the critical points in the process of extracting the frequent subgraphs with complete and incomplete approaches. Current canonical encodings have a complexity which is of exponential nature, motivating this paper to propose a relaxation of canonicity of the encoding leading to complete and incomplete encodings with a linear complexity. These techniques are implemented within our graph miner GGM (Generic Graph Miner) and then evaluated on a set of graph databases, showing the behavior of both complete and incomplete approaches.

Graph mining frequent subgraph pattern discovery graph isomorphism

Graph-mining represents the set of techniques used to extract interesting and previously unknown information about the relational aspect from data sets that are represented with graphs.

We revisit some approaches for graph mining where a set of simple encodings is proposed. Complete approaches are those using an encoding enabling to get all the frequent subgraphs. Whereas incomplete approaches do not guarantee to nd all the frequent subgraphs. Our objective is also to highlight the critical points in the process of extracting the frequent subgraphs. The introduced techniques are implemented within GGM (generic graph miner). We provide an experimentation with GGM showing the behavior of complete and incomplete approaches. It is not proven if the canonical encoding of graphs is in the class of NP-complete problems, nor in polynomial class. This is also veri ed in practice, ? This work is supported by TASSILI research program 11MDU839 (France, Algeria). since that all the current canonical encodings have complexities which are of exponential nature. This motivates deeply our work on proposing a relaxation of the canonicity of the encoding, leading us to what we quali ed in this paper complete and incomplete encodings with low polynomial complexities.

The following section 2 introduces preliminaries on graph mining and the current approaches to solve frequent subgraph discovery problem. Section 3 explains our graph mining algorithm GGM. Experimental results of GGM are given in section 4. Section 5 concludes the paper and addresses some perspectives. 2

Frequent subgraph discovery problem

An undirected graph G = (V; E) is made of the set of vertices V and the set of edges E V V . Each edge (v1; v2) is an unordered pair of vertices. We will assume that the graph is labeled with vertex labels LV and edge labels LE; the same label can be assigned to many vertices (or edges) in the same graph. The size of a graph G = (V; E) is de ned to be equal to jEj.

De nition 1 (Frequent Subgraph discovery). Given a database G which contains a collection of graphs. The frequency of a graph G in G is de ned by f req(G; G) = #fG0 2 GjG G0g. The support of a graph is de ned by support(G; G) = f req(G; G)=jGj:

The frequent subgraph discovery problem consists to nd all connected undirected graphs F that are subgraphs of at least minsupjGj graphs of G: F = fG 2 Gjsupport(G; G) minsupg; for some prede ned minimum support threshold minsup that is speci ed by the user.

Generally, we can distinguish between the methods of discovering frequent subgraphs according to the way the three following problems are handled: Candidates generation problem This is the rst step in the frequent subgraph discovery process which depends on the search strategy. It can be done with breadth rst or depth rst strategies. With breadth rst strategy, all k-candidates (i.e., having k edges) are generated together, then (k + 1)candidates and so on; making the memory consumption huge [ 4 ][ 3 ]. But with a depth approach, the k-candidates are iteratively generated, one by one. Subgraph encoding problem When some new candidate is produced, we should verify that it has been already generated. This can be resolved by testing if this new candidate is isomorphic to one of the already generated subgraphs. The canonical DFS code [ 6 ] is usually used to encode the generated frequent subgraphs. By this way, verifying that the new candidate is isomorphic to one of the already generated candidates is equivalent to testing if its encoding is equal to the encoding of some already generated candidate. Frequency computation problem If some new candidate is declared to be not isomorphic to any of the already produced candidates, we should compute its frequency. It could be done by nding all the graphs of the database which contain this new candidate.

In the following section, we present a new algorithm GGM - Generic Graph Miner - for nding connected frequent subgraphs in a graphs database. We propose also some simple encodings to handle e ciently the frequency counting problem. 3

GGM, a generic graph miner

GGM nds frequent subgraphs, parameterized with some encoding strategies detailed in section 3.1. It is generic, because we aim to make the key steps of GGM easily parameterized.

Algorithm 1 GGM(G, fmin) Require: G represents the graph dataset and fmin the minimum frequency threshold. Ensure: F is the set of frequent subgraphs in G. 1: F 2: E a;ll frequent edge labels in G 3: N all frequent node labels in G 4: P Generate-Paths(N , E, G, fmin) 5: T Generate-Trees(P, E, G, fmin) 6: C Generate-Cyclic-Graphs(T , E, G, fmin) 7: F 8: RETUPR[NTF[ C

The general structure of the algorithm is illustrated in algorithm 1. The algorithm initializes the frequent subgraphs with all frequent edges and nodes within the graph database G. Then, the algorithm proceeds with three separated steps: 1. enumerating frequent paths from the frequent nodes, 2. generating the frequent trees from the frequent paths by keeping the same extremities of each initial path, 3. extending the frequent paths and trees by adding an edge between two existing nodes to obtain cyclic graphs.

This approach is inspired from GASTON [ 5 ] in which these three steps are repeated for each discovered subgraph. In other words, GASTON loops on the above three steps, whereas in our approach, they are executed one time only. 3.1

Graph encoding The canonical labeling is used to check whether a particular candidate subgraph has already been generated or not. However, developing algorithms that can e ciently compute the canonical labeling is critical to ensure that the mining algorithm can scale to very large graph datasets. There exists di erent ways to assign a code to a given graph, but it must uniquely identify the graph such that if two graphs are isomorphic, they will be assigned the same code. Such encoding is called a canonical encoding. It is not proven if the canonical encoding of graphs is in the class of NP-complete problems, nor in polynomial class. This is also veri ed in practice, since that all the current canonical encodings have complexities which are of exponential nature.

The idea of our encoding is to use a non-canonical encoding, resulting in two kinds of encodings : complete and incomplete.

De nition 2 (Complete and incomplete encodings). Let f be an encoding function. For any two distinct non-isomorphic graphs G1 and G2, f is complete if f (G1) 6= f (G2). Otherwise, f is said to be incomplete. where ei <l ei+1, the relation <l de nes a lexicographic order between edges (e.g. (2; 3; X; s; Y ) < (2; 3; Z; s; Y )). The code associated to the graph (a) of Figure 1 is: (1,4,Z,r,Y)(2,3,X,s,X)(2,3,Z,q,X)(2,4,X,s,Y)(2,4,Z,t,Y)(3,4,X,t,Y). Enumerating the edges is done with O(m), where m is the number of edges, but sorting lexicographically the edges requires O(m log(m)) which is the worst case complexity of sorting algorithms. Thus, in nal, the complexity of this encoding is O(m log(m)). This encoding is not complete because we can nd examples of non-isomorphic graphs having the same code. 4

Experimental Results

We have done a comparison between our algorithm and the Gaston tool. Table 3 (resp. Table 4) shows the results of our algorithm with the rst database (resp. second database). The minimum frequency was set to 1 for the rst results. So, this table presents the runtimes to generate all the subgraphs of each database. While for the second case, we choose di erent MinFreq expressed also by MinSup (i.e. minimum support). For instance, for the PTE database which contains 340 graphs, the number of graphs that are subgraphs of at least 1 The Predictive Toxicology dataset (PTE) can http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/PTE. be downloaded from 60% = 230440 graphs of PTE, is given by #T otal = #f req:paths + #f req:trees + #f req:cyclic graphs. From this frequency, we see that there is no frequent cyclic graphs (i.e. #freq. cyclic graphs = 0).

Concerning the results on both databases, the complete encoding is usually less performant than the incomplete one. This is explained by the fact that the complete encoding handles larger set of candidates than the incomplete one. For the incomplete encoding, the number of frequent subgraphs discovered by GGM is not too far from that of Gaston. We notice also that from the frequency of 170 graphs, the result is the same.

Conclusion

In this paper, we have presented algorithm GGM for the frequent subgraph discovery problem in a graph datasets. We pointed out the key points in the graph mining process. We combined several strategies inspired from existing algorithms to implement GGM. The two important points in the process of discovery are the generation of new candidates and the frequency counting. Our experimentations show the e ectiveness of the incomplete approach compared to the complete one. It shows the importance of handling a reasonable amount of candidates. The main perspective is to improve our incomplete encoding. We have to experiment our incomplete approach on huge graph databeses such as those coming from chemistery. Actually, we have implemented the whole mining algorithm in GGM and confront its performance to Gaston.

Bettina

Berendt , Andreas Hotho, and

Stum

Gerd . Towards semantic web mining . In Proceedings of the First International Semantic Web Conference on The Semantic Web, ISWC '02 , pages 264 { 278 , London, UK, UK, 2002 . Springer-Verlag.

Mukund

Deshpande , Michihiro Kuramochi, Nikil Wale, and

George

Karypis . Frequent substructure-based approaches for classifying chemical compounds . IEEE Trans. on Knowl. and Data Eng ., 17 : 1036 { 1050 , August 2005 .

Akihiro

Inokuchi , Takashi Washio, and

Hiroshi

Motoda . An apriori-based algorithm for mining frequent substructures from graph data . In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery , PKDD '00 , pages 13 { 23 , London, UK, 2000 . Springer-Verlag.

Michihiro

Kuramochi and

George

Karypis . Frequent subgraph discovery . In Proceedings of the 2001 IEEE International Conference on Data Mining, ICDM '01 , pages 313 { 320 , Washington, DC, USA, 2001 . IEEE Computer Society.

Siegfried

Nijssen and

Joost N.

Kok . The gaston tool for frequent subgraph mining . Electron. Notes Theor. Comput. Sci. , 127 : 77 { 87 , March 2005 .

Xifeng

Yan and Jiawei Han. gspan: Graph-based substructure pattern mining . Order A Journal On The Theory Of Ordered Sets And Its Applications , 02 : 721 { 724 , 2002 .

7. Ken 'ichi Yoshida and Hiroshi Motoda. Clip: concept learning from inference patterns . Artif . Intell., 75 : 63 { 92 , May 1995 .