=Paper=
{{Paper
|id=None
|storemode=property
|title=Complete and Incomplete Approaches for Graph Mining
|pdfUrl=https://ceur-ws.org/Vol-867/Paper35.pdf
|volume=Vol-867
|dblpUrl=https://dblp.org/rec/conf/icwit/KemmarLOL12
}}
==Complete and Incomplete Approaches for Graph Mining==
Complete and incomplete approaches for graph mining? Amina Kemmar 1 , Yahia Lebbah 1 , Mohammed Ouali 1 , and Samir Loudni 2 University of Oran, Es-Senia, Lab. LITIO, B.P. 1524 EL-M’Naouar, Oran, Algeria 2 University of Caen - Campus II, Department of Computer Science, France {kemmami,ylebbah}@yahoo.fr,mohammed.ouali@gmail.com,samir.loudni@unicaen. fr Abstract. In this paper, we revisit approaches for graph mining where a set of simple encodings is proposed. Complete approaches are those using an encoding allowing to get all the frequent subgraphs. Whereas incomplete approaches do not guarantee to find all the frequent sub- graphs. Our objective is also to highlight the critical points in the process of extracting the frequent subgraphs with complete and incomplete ap- proaches. Current canonical encodings have a complexity which is of exponential nature, motivating this paper to propose a relaxation of canonicity of the encoding leading to complete and incomplete encod- ings with a linear complexity. These techniques are implemented within our graph miner GGM (Generic Graph Miner) and then evaluated on a set of graph databases, showing the behavior of both complete and incomplete approaches. Keywords: Graph mining, frequent subgraph, pattern discovery, graph isomorphism. 1 Introduction Graph-mining represents the set of techniques used to extract interesting and previously unknown information about the relational aspect from data sets that are represented with graphs. We revisit some approaches for graph mining where a set of simple encodings is proposed. Complete approaches are those using an encoding enabling to get all the frequent subgraphs. Whereas incomplete approaches do not guarantee to find all the frequent subgraphs. Our objective is also to highlight the criti- cal points in the process of extracting the frequent subgraphs. The introduced techniques are implemented within GGM (generic graph miner). We provide an experimentation with GGM showing the behavior of complete and incomplete approaches. It is not proven if the canonical encoding of graphs is in the class of NP-complete problems, nor in polynomial class. This is also verified in practice, ? This work is supported by TASSILI research program 11MDU839 (France, Algeria). 312 2 since that all the current canonical encodings have complexities which are of exponential nature. This motivates deeply our work on proposing a relaxation of the canonicity of the encoding, leading us to what we qualified in this paper complete and incomplete encodings with low polynomial complexities. The following section 2 introduces preliminaries on graph mining and the cur- rent approaches to solve frequent subgraph discovery problem. Section 3 explains our graph mining algorithm GGM. Experimental results of GGM are given in section 4. Section 5 concludes the paper and addresses some perspectives. 2 Frequent subgraph discovery problem An undirected graph G = (V, E) is made of the set of vertices V and the set of edges E ⊆ V × V . Each edge (v1 , v2 ) is an unordered pair of vertices. We will assume that the graph is labeled with vertex labels LV and edge labels LE ; the same label can be assigned to many vertices (or edges) in the same graph. The size of a graph G = (V, E) is defined to be equal to |E|. Definition 1 (Frequent Subgraph discovery). Given a database G which contains a collection of graphs. The frequency of a graph G in G is defined by f req(G, G) = #{G0 ∈ G|G ⊆ G0 }. The support of a graph is defined by support(G, G) = f req(G, G)/|G|. The frequent subgraph discovery problem consists to find all connected undi- rected graphs F that are subgraphs of at least minsup|G| graphs of G: F = {G ∈ G|support(G, G) ≥ minsup}, for some predefined minimum support threshold minsup that is specified by the user. Generally, we can distinguish between the methods of discovering frequent subgraphs according to the way the three following problems are handled: Candidates generation problem This is the first step in the frequent sub- graph discovery process which depends on the search strategy. It can be done with breadth first or depth first strategies. With breadth first strategy, all k-candidates (i.e., having k edges) are generated together, then (k + 1)- candidates and so on; making the memory consumption huge [4][3]. But with a depth approach, the k-candidates are iteratively generated, one by one. Subgraph encoding problem When some new candidate is produced, we should verify that it has been already generated. This can be resolved by testing if this new candidate is isomorphic to one of the already generated subgraphs. The canonical DFS code [6] is usually used to encode the gen- erated frequent subgraphs. By this way, verifying that the new candidate is isomorphic to one of the already generated candidates is equivalent to testing if its encoding is equal to the encoding of some already generated candidate. 313 Complete and incomplete approaches for graph mining?? 3 Frequency computation problem If some new candidate is declared to be not isomorphic to any of the already produced candidates, we should com- pute its frequency. It could be done by finding all the graphs of the database which contain this new candidate. In the following section, we present a new algorithm GGM - Generic Graph Miner - for finding connected frequent subgraphs in a graphs database. We pro- pose also some simple encodings to handle efficiently the frequency counting problem. 3 GGM, a generic graph miner GGM finds frequent subgraphs, parameterized with some encoding strategies detailed in section 3.1. It is generic, because we aim to make the key steps of GGM easily parameterized. Algorithm 1 GGM(G, fmin ) Require: G represents the graph dataset and fmin the minimum frequency threshold. Ensure: F is the set of frequent subgraphs in G. 1: F ← ∅ 2: E ← all frequent edge labels in G 3: N ← all frequent node labels in G 4: P ← Generate-Paths(N , E, G, fmin ) 5: T ← Generate-Trees(P, E, G, fmin ) 6: C ← Generate-Cyclic-Graphs(T , E, G, fmin ) 7: F ← P ∪ T ∪ C 8: RETURN F The general structure of the algorithm is illustrated in algorithm 1. The algorithm initializes the frequent subgraphs with all frequent edges and nodes within the graph database G. Then, the algorithm proceeds with three separated steps: 1. enumerating frequent paths from the frequent nodes, 2. generating the frequent trees from the frequent paths by keeping the same extremities of each initial path, 3. extending the frequent paths and trees by adding an edge between two ex- isting nodes to obtain cyclic graphs. This approach is inspired from GASTON [5] in which these three steps are repeated for each discovered subgraph. In other words, GASTON loops on the above three steps, whereas in our approach, they are executed one time only. 3.1 Graph encoding The canonical labeling is used to check whether a particular candidate subgraph has already been generated or not. However, developing algorithms that can efficiently compute the canonical labeling is critical to ensure that the mining algorithm can scale to very large graph datasets. There exists different ways to assign a code to a given graph, but it must uniquely identify the graph such 314 4 that if two graphs are isomorphic, they will be assigned the same code. Such encoding is called a canonical encoding. It is not proven if the canonical encoding of graphs is in the class of NP-complete problems, nor in polynomial class. This is also verified in practice, since that all the current canonical encodings have complexities which are of exponential nature. The idea of our encoding is to use a non-canonical encoding, resulting in two kinds of encodings : complete and incomplete. Definition 2 (Complete and incomplete encodings). Let f be an encoding function. For any two distinct non-isomorphic graphs G1 and G2 , f is complete if f (G1 ) 6= f (G2 ). Otherwise, f is said to be incomplete. DFS based complete encoding This encoding is a relaxation of that defined in [6]. Such encoding is processed by taking only one walk through a depth first search (and not the minimum as in [6]). It is straightforward that this encoding is complete, and the same graph can be generated several times as illustrated in Figure 1 which shows that two isomorphic graphs can have different codes. For the graph (a) in Figure 1, there exists several DFS codes. Two of them, which are based on the DFS trees in Figure 1(b)-(c) are listed in Table 1. edge 0 1 2 3 4 5 Fig 1.(b) (1, 2, X, s, Y ) (2, 3, Y, t, X) (3, 1, X, s, X) (3, 4, X, q, Z) (4, 2, Z, t, Y ) (2, 5, Y, r, Z) Fig 1.(c) (1, 2, Y, s, X) (2, 3, X, s, X) (3, 1, X, t, Y ) (3, 4, X, q, Z) (4, 1, Z, t, Y ) (1, 5, Y, r, Z) Table 1. DFS codes for Figure 1 (b)-(c) Fig. 1. Different DFS trees associated to the labeled graph (a) Since this encoding visits only once the edges, it is then straightforward that the worst-case complexity is O(m), where m is the number of edges. Edge sequence based incomplete encoding Given a graph G and an edge eij = (vi , vj ) ∈ G, where deg(vi ) ≤ deg(vj ), the edge eij is represented by the 5-tuple: (deg(vi ), deg(vj ), lv (vi ), le (vi , vj ), lj (vj )), where deg(v) is the degree of v in the graph G, lv (v) and le (e) are the labels of the vertex v and the edge e respectively. Given a graph G, we denote SEQ-DEG(G) the sequence of its edge codes : SEQ − DEG(G) = code(e1 )code(e2 )...code(e|E| ) where ei2000 0,01 0,60 4,29 0,01 0,26 0,68 Table 4. Results of Gaston and GGM (with the DFS encoding and SEQ-DEG) on the graph dataset PTE. MinSup represents the minimum support threshold and MinFreq the minimum frequency. We have done a comparison between our algorithm and the Gaston tool. Ta- ble 3 (resp. Table 4) shows the results of our algorithm with the first database (resp. second database). The minimum frequency was set to 1 for the first re- sults. So, this table presents the runtimes to generate all the subgraphs of each database. While for the second case, we choose different MinFreq expressed also by MinSup (i.e. minimum support). For instance, for the PTE database which contains 340 graphs, the number of graphs that are subgraphs of at least 1 The Predictive Toxicology dataset (PTE) can be downloaded from http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/PTE. 316 6 60% = 204 340 graphs of PTE, is given by #T otal = #f req.paths + #f req.trees + #f req.cyclic graphs. From this frequency, we see that there is no frequent cyclic graphs (i.e. #freq. cyclic graphs = 0). Concerning the results on both databases, the complete encoding is usually less performant than the incomplete one. This is explained by the fact that the complete encoding handles larger set of candidates than the incomplete one. For the incomplete encoding, the number of frequent subgraphs discovered by GGM is not too far from that of Gaston. We notice also that from the frequency of 170 graphs, the result is the same. 5 Conclusion In this paper, we have presented algorithm GGM for the frequent subgraph discovery problem in a graph datasets. We pointed out the key points in the graph mining process. We combined several strategies inspired from existing algorithms to implement GGM. The two important points in the process of discovery are the generation of new candidates and the frequency counting. Our experimentations show the effectiveness of the incomplete approach compared to the complete one. It shows the importance of handling a reasonable amount of candidates. The main perspective is to improve our incomplete encoding. We have to experiment our incomplete approach on huge graph databeses such as those coming from chemistery. Actually, we have implemented the whole mining algorithm in GGM and confront its performance to Gaston. References 1. Bettina Berendt, Andreas Hotho, and Stum Gerd. Towards semantic web mining. In Proceedings of the First International Semantic Web Conference on The Semantic Web, ISWC ’02, pages 264–278, London, UK, UK, 2002. Springer-Verlag. 2. Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, and George Karypis. Fre- quent substructure-based approaches for classifying chemical compounds. IEEE Trans. on Knowl. and Data Eng., 17:1036–1050, August 2005. 3. Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Proceedings of the 4th Eu- ropean Conference on Principles of Data Mining and Knowledge Discovery, PKDD ’00, pages 13–23, London, UK, 2000. Springer-Verlag. 4. Michihiro Kuramochi and George Karypis. Frequent subgraph discovery. In Pro- ceedings of the 2001 IEEE International Conference on Data Mining, ICDM ’01, pages 313–320, Washington, DC, USA, 2001. IEEE Computer Society. 5. Siegfried Nijssen and Joost N. Kok. The gaston tool for frequent subgraph mining. Electron. Notes Theor. Comput. Sci., 127:77–87, March 2005. 6. Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. Order A Journal On The Theory Of Ordered Sets And Its Applications, 02:721–724, 2002. 7. Ken’ichi Yoshida and Hiroshi Motoda. Clip: concept learning from inference pat- terns. Artif. Intell., 75:63–92, May 1995. 317