=Paper=
{{Paper
|id=Vol-1670/paper-19
|storemode=property
|title=None
|pdfUrl=https://ceur-ws.org/Vol-1670/paper-19.pdf
|volume=Vol-1670
}}
==None==
Min-Hashing for Probabilistic Frequent Subtree Feature Spaces? Pascal Welke1 , Tamás Horváth1,2 , and Stefan Wrobel1,2 1 Dept. of Computer Science, University of Bonn, Germany 2 Fraunhofer IAIS, Schloss Birlinghoven, Sankt Augustin, Germany Abstract. We propose a fast algorithm for approximating graph similarities. Here, the similarity between two graphs is defined by the Jaccard-similarity of their images in a binary feature space spanned by the set of frequent subtrees generated for some training dataset. While being an adequate choice for many similarity based learning tasks, this approach su↵ers from severe computational limitations. In particular, mining frequent trees in arbitrary graph databases cannot be done in output polynomial time and embedding a graph in the above space is NP-hard. To overcome these limitations, we represent each graph by k of its spanning trees generated uniformly at random. In this way, we reduce the frequent sub- graph mining, as well as the embedding of a graph into the feature space to problems involving only trees and forests. Clearly, the output of this probabilis- tic technique is always sound (any tree found to be frequent by this algorithm is a frequent subtree with respect to the original dataset), but incomplete (the al- gorithm may miss frequent subtrees). Similarly, the embedding of a given graph in the feature space spanned by the above trees is computed with a one-sided error. We improve the speed and space consumption of the above method by applying min-hashing for the embedding step. Each graph is represented by a small sketch vector that can be used to approximate Jaccard-distances. We show that the partial order on the feature set defined by subgraph isomorphism allows for a fast calculation of the min-hash sketch, without explicitly performing the feature space embedding. Our experimental results demonstrate that the proposed technique can dra- matically reduce the number of subtree isomorphism tests, compared to an al- gorithm performing the embedding explicitly. We also show that even for a few random spanning trees per chemical compound, remarkable precisions of the active molecules can be obtained in a highly imbalanced chemical dataset by taking the i nearest neighbors of an active compound. Finally, we show that the predictive power of support vector machines using our approximate similarities compares favorably to that of state-of-the-art related methods. A long version of this extended abstract appears in [1]. [1] P. Welke, T. Horváth, and S. Wrobel. Min-Hashing for Probabilistic Frequent Sub- tree Feature Spaces. To appear in: Proceedings of the 19th International Conference on Discovery Science, DS 2016, Springer LNAI, 2016. ? Copyright c , 2016 by the papers authors. Copying permitted only for private and academic purposes.