=Paper= {{Paper |id=None |storemode=property |title=Symbolic Learning vs. Graph Kernels: An Experimental Comparison in a Chemical Application |pdfUrl=https://ceur-ws.org/Vol-639/031-brun.pdf |volume=Vol-639 |dblpUrl=https://dblp.org/rec/conf/adbis/BrunCFVV10 }} ==Symbolic Learning vs. Graph Kernels: An Experimental Comparison in a Chemical Application== https://ceur-ws.org/Vol-639/031-brun.pdf
      Symbolic Learning vs. Graph Kernels:
    An Experimental Comparison in a Chemical
                  Application

      Luc Brun1 , Donatello Conte3 , Pasquale Foggia3 , Mario Vento3 , and
                              Didier Villemin2
                           1
                             GREYC UMR CNRS 6072,
                               2
                               LCMT UMR CNRS 6507
                 ENSICAEN-Université de Caen Basse-Normandie,
                                14050 Caen, France
         luc.brun@greyc.ensicaen.fr, didier.villemin@ensicaen.fr
     3
       Dipartimento di Ingegneria dell’Informazione e di Ingegneria Elettrica,
    Università di Salerno, Via Ponte Don Melillo, 1 I-84084 Fisciano (SA), Italy
                       {dconte, pfoggia, mvento}@unisa.it



      Abstract. In this paper we present a quantitative comparison between
      two approaches, Graph Kernels and Symbolic Learning, within a classifi-
      cation scheme. The experimental case-study is the predictive toxicology
      evaluation, that is the inference of the toxic characteristics of chemical
      compounds from their structure. The results demonstrate that both ap-
      proaches are comparable in terms of accuracy, but present pros and cons
      that are discussed in the last part of the paper.


1   Introduction

Many scientific areas are populated by applications dealing with structured in-
formation, i.e., complex information which can be seen as made of simpler parts
suitably interconnected. Structured information is widely used in many applica-
tive domains as software engineering, pattern recognition, chemistry, medicine,
linguistics, etc. Usually structured information is represented by means of graphs
because of their high expressive power.
    Despite their attractiveness in terms of representational power, structural
methods (i.e., methods dealing with structured information) imply complex pro-
cedures both in the recognition and in the learning processes. The difficulty of
defining effective algorithms for facing this task is so high that the problem is
considered still open.
    In this paper we focus our attention on two different approaches using graphs
for classification tasks. The first one is based on the idea of facing the learning
problem directly in the representation space of the structured data (we will call
it Symbolic Learning). The second is based on the notion of Graph Kernels that
gives a measure of similarity between graphs which corresponds to a metric thus
making possible the adoption of well-known statistical/neural paradigms. Here
32       L. Brun et al.

we present several Graph Kernels based on the notion of bag of patterns and
graph edit distance that both preserve most of the structural information of
graphs.
    The chosen application domain is the Predictive Toxicology Evaluation, that
is the characterization of the toxic nature of chemical compounds from their
chemical structure. Namely, we used data from the Predictive Toxicology Chal-
lenge 4 , a competition based on the prediction of cancerogenic molecules. This
domain is appropriate for the analysis because of the structured nature of chem-
ical compounds.
    The paper is organized as follows: in section 2 a brief survey of the two
approaches is presented, while details on the two methods will be given in the
section 3; section 4 reports the results of an experimental case-study of both the
approaches. Finally, in the last section, conclusions are outlined.


2      Related Works

The Symbolic Learning can be summarized as follows: the goal of the system is to
build a structural description for each class (prototypes) so that the classification
is conducted by mean of a structural matching between the prototypes and the
structured representation of the object to classify.
    Papers dealing with this approach can be ascribed to two rather different
categories. The first category collects methods based on the assumption that
the prototypical descriptions are built by interacting, more or less deeply, with
an expert of the domain. To this concern, some authors [12] start from some
ideal prototypes and refine manually, in successive approximations, a model of
the possible deformations. Others [10] assume a deformational model and use
a small training set for tuning the parameters contained in it. Despite the ad-
vantage of facing the problem without a training set or at least with a small
training set, this approach is effective only in simple problems. The inadequacy
of human knowledge to find a set of prototypes really representative of a given
class significantly increases the risk of errors, especially in domains containing
either many data or many different classes.
    The methods ascribed to the second category face the problem in a formal
way ([8]) by considering it as a symbolic machine learning problem, so formu-
lated: “Given a suitably chosen set of input data, whose class is known and pos-
sibly some background domain knowledge, find out a set of optimal prototypical
descriptions for each class”.

   Graph kernels methods are based on an implicit embedding of graphs within
a vector space of large dimension. Several graph kernels are introduced in liter-
ature but graph kernels built on the notion of bag of patterns seem to be very
promising. The design of such kernels, implies the following choices or heuristics:

 1. Selection of a bag of patterns from a graph,
4
     http://www.predictive-toxicology.org
         Symbolic Learning vs. Graph Kernels: An Experimental Comparison        33

 2. Definition of a kernel between bags based on a kernel between patterns,
 3. Definition of a kernel between patterns.

Kernels between graphs then correspond to kernels between bags of patterns.
The implicit assumption being here that bags of patterns capture most of the
structural information of graphs. Several kernels have been proposed based on
different types of patterns: Vert [7] proposed to compare the set of sub-trees of
two graphs; Graphlets kernels [13] are based on the number of common sub-
graphs of two graphs. However, the comparison of complex patterns requires
important processing times for an insight which is not always clearly demon-
strated [7]. As a consequence most of graph kernels are based on simpler patterns
such as walks [6], trails [2] or paths. Bags of patterns may be finite or infinite.
Infinite bags, such as bag of walks are computed implicitly using random walks
defined on both graphs being compared [6]. Bags of patterns may also be com-
puted explicitly [2] by enumerating all patterns of a given graph. Given two bags
of patterns the similarity between both bags is often measured using convolution
kernels [4]. Both bags may alternatively be considered as two sets in the feature
space induced by the kernel between patterns. This point of view allows to mea-
sure the similarity between bags using general kernels between sets. Finally, the
similarity between patterns should be designed according both to the feature
available on each element of the pattern and to the structural noise which may
corrupt patterns. Using walks on labeled graphs the well known kernel [6] be-
ing equal to 1 if both walks are similar, an 0 otherwise, encodes the fact that
labels are either equal or not comparable. The construction of a bag of patterns
from a graph removes connections between elementary patterns within input
graphs which are then only represented by their bags. As previously mentioned,
assumption of bag of pattern kernels is that most of the relevant information
about graphs is contained in the pattern and thus that most of the information
is preserved by the transformation from a graph to a bag. Bunke [1] proposed
several kernels based on the graph edit distance which do not rely on such an
assumption.


3     Methods Description

3.1   Symbolic Learning

The Symbolic Learning Method presented in this section (hereafter called SLM)
is inspired by Quinlan’s FOIL algorithm particularizing it for Attributed Re-
lational Graph (ARG) representation. An extension of this representation has
been presented for describing prototypes of Attributed Relational Graphs; these
graphs, called Generalized Attributed Relational Graphs (GARGs), contain gen-
eralized nodes, edges and attributes. Then, a learning algorithm has been formu-
lated to build such prototypes by means of a set of operations directly defined
on graphs. The algorithm preserves the generality of the prototypes generated
by the classical machine learning algorithms and moreover, similarly to most
34     L. Brun et al.

machine learning systems ([8]), the prototypes obtained by this system are con-
sistent, i.e. each prototype covers samples of a same class.
    An ARG can be defined as a 6-tuple (N, E, AN , AE , aN , aE ), where N and
E ⊂ N × N are, respectively, the sets of nodes and edges of the ARG, AN and
AE the sets of nodes and edge attributes and, finally, aN and aE the functions
which associate to each node or edge of the graph the corresponding attributes.
    A GARG is used for representing a prototype of a set of ARGs. In order
to allow a GARG to match a set of possibly different ARGs, in the SLM the
attribute definition has been extended. First of all, the set of types of node and
edge attributes is extended with the special type Φ, carrying no parameter and
allowed to match any attribute type, ignoring the attribute parameters. We say
that a GARG G∗ = (N, E, AN , AE , aN , aE ) covers a sample G and we use the
notation G∗ ⊲ G (the symbol ⊲ denotes the relation called covering) if there is
a mapping µ : N ∗ → N such that µ is a monomorphism and the attributes of
the nodes and of the edges of G∗ are compatible with the corresponding ones
of G. The first condition requires that each primitive and each relation in the
prototype is present also in the sample; note that the converse condition does
not hold. The latter condition constrains the monomorphism to be consistent
with the attributes of the prototype and of the sample in the sense that the type
of the attribute of the prototype must be either equal to the special type Φ or
to the type of the corresponding attribute of the sample. In the latter case, all
the parameters of the attribute, which are actually sets of values, must contain
the value of the corresponding parameter of the sample. The goal of the learning
algorithm can be stated as follows: there is a (possibly infinite) set S ∗ of all the
patterns that may occur, partitioned into C different classes S1∗ , . . . , SC    ∗
                                                                                    , with
  ∗     ∗                                                                      ∗
Si ∩ Sj = φ for i 6= j; to the algorithm is given a finite subset S ⊂ S (training
set) of labeled patterns (S = S1 ∪ · · · ∪ SC with Si = S ∩ Si∗ ), from which it tries
to find a sequence of prototype graphs G∗1 , G∗2 , . . . , G∗p , each labeled with a class
identifier, such that:

              ∀G ∈ S ∗ ∃i : G∗i ⊲ G(completeness of the prototype set)                (1)

 ∀G ∈ S ∗ , G∗i ⊲G ⇒ class(G) = class(G∗i )(consistency of the prototype set) (2)
where class(G) and class(G∗ ) refer to the class associated with sample G and
prototype G∗ , respectively. Of course, this is an ideal goal since only a finite
subset of S ∗ is available to the algorithm; in practice, the algorithm can only
demonstrate that completeness and consistency hold for the samples in S. On
the other hand, (1) dictates that, in order to get as close as possible to the ideal
case, the prototypes generated should be able to model samples also not found
in S. However, they should not be too general otherwise (2) will not be satisfied.
    A description of the learning algorithm is presented in the following: the
algorithm starts with an empty list L of prototypes and tries to cover the train-
ing set by successively adding consistent prototypes. When a new prototype is
found, the samples covered by it are eliminated and the process continues on the
remaining samples of the training set. Then a sample is compared sequentially
against the prototypes in the same order in which they have been generated, and
          Symbolic Learning vs. Graph Kernels: An Experimental Comparison          35

it is attributed to the class of the first prototype that covers it. In this way, each
prototype implicitly entails the condition that the sample is not covered by any
previous prototype. The algorithm fails if no consistent prototype covering the
remaining samples can be found. It is worth noting that the test of consistency
in the algorithm actually checks whether the prototype is almost consistent, i.e.
almost all the samples covered by G belongs to the same class:
                 Consistent(G∗ ) ⇔ max (|Si (G∗ )| / |S(G∗ )|) ≥ θ                (3)
                                        i

where S(G∗ ) denotes the sets of all the samples of the training set covered by
a prototype G∗ , and Si (G∗ ) the samples of the class i covered by G∗ and θ
is a threshold close to 1. For further details about the representation and the
algorithm you can refer to [3].




Fig. 1. An example of GARG covering two graphs. The example is extracted from the
application domain used in the experimental phase. In bold the parts of the two graphs
covered by G∗



   An heuristic function H is introduced for evaluating how promising is the
provisional prototype. The heuristic function used in the algorithm is given by
the product of Hcompl and Hcons :
                        H = Hcompl (S, G∗ ) · Hcons (S, G∗ )                      (4)
and it is based on estimation of consistency and completeness of prototype. The
use of Hcons will drive the algorithm towards consistent prototypes while the
term Hcompl is introduced in order to privilege general prototypes with respect
to prototypes which, albeit consistent, cover only a small number of samples.
    See Figure 1 for an example of a GARG G∗ that covers the graphs G1 and
G2 representing two chemical compounds.

3.2   Graph Kernels
In this section we will introduce two kind of Graph Kernels: Bag of trails Kernel
and Laplacian Kernel.
36     L. Brun et al.

Bag of trails Kernel Let us consider a graph G = (V, E) where V denotes the
set of vertices and E ⊂ V × V the set of edges. We define a simple-graph as a
graph with no multiple edges between two nodes and no loop (an edge linking
a node with itself). We define a walk as an alternating sequence of nodes and
edges, a trail as a walk with distinct edges and a path as a trail with distinct
nodes. The description of a graph by a bag of walks is penalized by a tottering
effect since long walks may remain on few edges hereby describing small parts of
a graph. Bags of trails reduce drastically this tottering effects and may capture
non linear features such as loops which are not described by open Paths. We
thus use bags of trails, in the remaining part of this paper. The cardinality of a
bag T being denoted by |T |. The kernel between trails is denoted Ktrail .
    Suard [16] has proposed numerous kernels for bags of paths which can be
easily extended to bags of trails. We proposed to extend the mean kernel to bag
of trails. By considering bag of trails as sets and a trail kernel Ktrail , we can
design a convolution kernel called mean kernel as follows:
                                      1 1 X X
                  Kmean (T1 , T2 ) =                  Ktrail (t, t′ ).          (5)
                                     |T1 | |T2 | ′        t∈T1 t ∈T2

where T1 and T2 denote two bags of trails.
    Following Haussler [4], this kernel is positive definite on the bag of trails
domain if and only if Ktrail is positive definite on the trail domain.
    However, the mean kernel compares all the paths of both bags without con-
sidering their relevance in both bags. As a consequence, irrelevant paths may
corrupt the overall value of the kernel due to the "meaning" effect. Recently
Dupé et al. [2], proposed to limit this effect by using a relevance measure of
trails. Let T1 and T2 two bags of trails, the weighted mean kernel between these
two bags is then defined as:
                           1 1 X X
   Kweighted (T1 , T2 ) =                   < RT1 (t), RT2 (t′ ) >d Ktrail (t, t′ ). (6)
                          |T1 | |T2 |  ′ t∈T1 t ∈T2
                +
where d ∈ R allows to weight the importance of the relevance measure and
RT (t) denote the relevance of trail t according to the bag T . This relevance
measure is defined as the similarityP     (the kernel) between the input trail and the
mean trail of the bag (RT (t) = |T1 | t′ ∈T Ktrail (t, t′ )).
    Kernel defined by equation 6 is a convolution kernel and so is positive definite
if Ktrail is positive definite.
    The above bag of trail kernels are based on a trail kernel Ktrail . A kernel
between two trails t = (v1 , e1 , . . . , vn ) and t′ = (v1′ , e′1 , . . . , vp′ ) can be derived
from the path kernel proposed by Kashima [6]. This kernel is defined as 0 if both
trails have not the same size and as follows otherwise:
                                               |t|
        Ktrail (t, t′ ) = δ(ϕ(v1 ), ϕ(v1′ ))                                                  ′
                                               Y
                                                   δ(ψ(evi−1 vi ), ψ(evi−1
                                                                       ′   vi′ ))δ(ϕ(vi ), ϕ(vi )),   (7)
                                               i=2

 where ϕ(v) and ψ(e) denote respectively the labels of vertex v and edge e while
δ is equal to 1 if its both arguments are equal and 0 otherwise. The kernel Ktrail
is definite positive as a tensor product of definite positive kernels.
          Symbolic Learning vs. Graph Kernels: An Experimental Comparison          37

Laplacian Kernel The graph edit distance between two graphs is defined as
the minimal number of vertices and edge removal/addition/relabeling required
to transform one graph into an other [9]. The computational complexity of the
exact computation of the edit distance is exponential in the number of vertices.
Fortunately, Riesen and Bunke [11] proposed an heuristic to efficiently approxi-
mate the edit distance between two graphs.
    Unfortunately, the edit distance does not define a metric and, as a conse-
quence, kernels directly based on the edit distance does not define a valid kernel.
    Let us consider a set of input graphs {G1 , . . . , Gn } defining our graph test
database. Given a kernel k, the gram matrix K associated to this database is
an n × n matrix defined by Ki,j = k(Gi , Gj ). As denoted by Steinke [15], the
inverse of K (or its pseudo inverse if K is not invertible) may be considered as
a regularization operator on the set of vector of dimension n. Conversely, the
inverse (or pseudo inverse) of any definite positive regularization operator may be
considered as a kernel. More precisely, within an interpolation or classification
scheme, the optimal mapping of a real value f ∗ to each graph {G1 , . . . , Gn }
which interpolates a vector of know values y while minimizing the norm induced
by K in Rn is achieved by solving the following minimization problem:
                   f ∗ = arg minf ∈Rn CLoss(f, y, K) + f t K −1 f
where CLoss(·, ·, ·) denotes a given loss function encoding the distance between
vector f and the vector of known values y.
   One scheme to design a kernel consists thus to first build a definite positive
regularization operator and then to take its inverse (or its pseudo inverse) to
obtain a kernel. Having an interpolation problem in mind, we would like that our
regularization operator penalizes different values of f ∗ mapped onto graphs with
a small edit distance. We thus consider the Laplacian operator defined as follows:
given the set of graphs, {G1 , . . . , Gn }, we first consider the n × n adjacency
                    d(Gi ,Gj )
matrix Wi,j = e− σ          where σ is a tuning variable and d(·, ·) denotes the
edit distance [11]. The normalized Laplacian of {G1 , . . . , Gn } is then defined as
             1      1                                                     Pn
L = I − D− 2 W D− 2 where D is a diagonal matrix defined by Di,i = j=1 Wi,j .
    Well know results from spectral graph theory establish that L is a symmetric,
semi definite positive matrix whose eigenvalues belongs to [0, 2]. Unfortunately,
the Laplacian is also well know for being non invertible. The only semi definite
property of the Laplacian matrix forbids a direct inversion of this matrix. More-
over, the pseudo inverse of the Laplacian induces numerical instabilities which
does not lead to a reliable kernel. Therefore, following Smola [14], we rather reg-
ularize the spectrum of L. The regularized version of L, denoted as L̃, is defined
as :
                                     Xn
                                L̃ =     r(λi )ui uti                             (8)
                                      i=1
where {λi , ui } denotes the eigensystem of L while r(·) is a regularization function.
Many regularization functions may be used, however we restricted our choice to
Regularized Laplacian ([14]): r(λ) = 1 + ηλ where η is a tuning variable. The
regularized laplacian L̃ is invertible and its inverse K = L̃−1 is taken as a kernel.
38       L. Brun et al.

4      Experimental Evaluation

The application used for the comparison between the two approaches is the
Predictive Toxicology Challenge.


4.1     The Application Domain and the Test Sets

Nowadays, almost every human being in industrialized societies is exposed to
chemical compounds whose long-term effects are not evident. The carcinogenic-
ity of such compounds, for instance, is often unknown. The US National Toxicol-
ogy Program (NTP) 5 conducted several experimentations on the effects of many
chemical compounds on mice and rats, but such empirical results are expensive
and very slow to obtain. Hence the necessity to find carcinogenicity models able
to generate reliable toxicity predictions for other compounds. The Predictive
Toxicology Challenge (PTC) 6 was initiated in this aim. It enabled participants
to submit machine learning programs which, from the large training datasets of
the NTP, provide carcinogenicity predictions of compounds in test sets. Accord-
ing to the Structure Activity Relationship (SAR) hypothesis, which assume that
the activity of a molecule can be entirely determined by its structure, the only
informations used for toxicity prediction of a molecule are related to its chemical
structure.
    Four distinct training and test sets were provided (see Table 1), since mice
and rats reactions to carcinogens compounds are significantly different. In this
paper we consider all the provided datasets using, for both the methods, the
following representation: each chemical compound is represented by a graph
G(V, E, α, β) where V represents the set of atoms, E is the set of edges repre-
senting bonds, α is the nodes labeling function that associates an atom symbol
to each node of the graph and β is the edges labeling function that associates
a bond multiplicity to each edge of the graph. The separation of the sets in
training and test set was made directly by the PTC committee, therefore no
cross-validation in training phase is allowed.


Table 1. Number of Chemical Compounds in the sets used for the experimentation.
The chemical compounds are separated in 4 sets according to the reactions of: Female
Mice (FM), Male Mice (MM), Female Rats (FR) and Male Rats (MR)

                     Dataset                    FM MM FR MR
                                 Cancerogenic    78 67 62 70
                  Training Set
                               Non-Cancerogenic 124 123 140 124
                                 Cancerogenic    35 28 35 50
                    Test Set
                               Non-Cancerogenic 43 36 63 55



5
     National Toxicology Program: http://ntp.niehs.nih.gov/
6
     http://www.predictive-toxicology.org/
         Symbolic Learning vs. Graph Kernels: An Experimental Comparison         39

4.2   Evaluation and Comparison

In Table 2 we report the comparative results obtained on FM, MM, FR and MR
datasets by the two different approaches (the classification by Graph Kernels is
made by means of SVM classifier).
    The three methods (two methods from the Graph Kernels) have comparable
results. However the time required, in the training phase, by Symbolic Learning
method is very high.
    We can conclude that in problems that have strict time constraints Graph
Kernels method is very effective while Symbolic Learning can be used only on
smaller problems, or where there are less time constraints. On the other hand
Symbolic Learning generates explicit, declarative knowledge on the classes in the
form of GARGs that may be understood with little effort by a human expert:
a direct inspection of such explicit prototypes can be sufficient to validate them
or possibly to delete the badly formed prototypes that could be detrimental
for classification performance. The Graph Kernel approach, instead, is able to
perform a classification, but not to explain the criteria that guided its choice.


Table 2. Accuracy and Elapsed Time (of the training phase) for the two methods on
the FM, MM, FR and MR datasets

              Method           FM MM FR MR Elapsed Time
         Symbolic Learning 60.4% 52.1% 62.0% 62.0% ∼ 2000 min
         Bag of trails Kernel 61.2% 50.4% 62.8% 65.6% ∼ 2 min
          Laplacian Kernel 59.2% 55.2% 57.7% 60.9% ∼ 1 min 30 sec



    In conclusion the results confirm that there is no way to decide what is the
best method because both approaches have their peculiar virtues and deficien-
cies.
    In our view, the two approaches should not be seen as competitors; instead,
only their cooperative integration can provide us with more powerful tools for
knowledge exploitation. The first step toward the integration of the two method-
ologies could be their combined use in a multiexpert system [5]. As widely demon-
strated in the literature, an accurate selection of the combining criteria allows us
to significantly improve the results obtained by the single experts by retaining
their strengths, while overcoming their weaknesses. We made a preliminary eval-
uation in this direction: the results show that the two systems exhibit a certain
degree of orthogonality in their errors: that is, some of the samples misrecognized
by one of the algorithms are correctly classified by the other one.


5     Conclusions

In this paper we presented a comparison between two approaches, Graph Ker-
nels and Symbolic Learning, within a classification scheme. Our experiments
40     L. Brun et al.

indicate that both approaches are comparable in terms of accuracy. Graph ker-
nel methods require less execution times but does not provide an explanation of
the classification criteria. Symbolic Learning, on the other hand have the reverse
advantages and drawbacks. Both methods are thus clearly complementary.
    Future steps, as we discussed in the last section, could be an integration of
the two approaches in order to improve the results obtained by the application
of the single approaches.

Acknowledgments. We would like to express our thanks to Alice Kijewski and
David Lemaresquier for their support in the experimental phase.

References
 1. Bunke, H., Riesen, K.: A family of novel graph kernels for structural pattern recog-
    nition. In: XII Iberoamerican Congress on Pattern Recognition. pp. 20–31 (2007)
 2. Dupé, F.X., Brun, L.: Tree covering within a graph kernel framework for shape
    classification. In: XV ICIAP (2009)
 3. Foggia, P., Genna, R., Vento, M.: Symbolic vs connectionist learning: an experi-
    mental comparison in a structured domain. IEEE Transaction on Knowledge and
    Data Engineering 13–2, 176–195 (2001)
 4. Haussler, D.: Convolution kernels on discrete structures. Tech. rep., Department
    of Computer Science, University of California at Santa Cruz (1999)
 5. Iannello, G., Percannella, G., Sansone, C., Soda, P.: On the use of classification
    reliability for improving performance of the one-per-class decomposition method.
    Data & Knowledge Engineering 68, 1398–1410 (2009)
 6. Kashima, H., Tsuda, K., Inokuchi, A.: Marginalized kernel between labeled graphs.
    In: In Proc. of the Twentieth International conference on Machine Learning (2003)
 7. Mahé, P., Vert, J.P.: Graph kernels based on tree patterns for molecules. Machine
    Learning 75(1), 3–35 (2008)
 8. Michalski, R.: Pattern recognition as rule-guided inductive inference. IEEE Trans-
    action on PAMI 2–4, 349–361 (1980)
 9. Neuhaus, M., Bunke, H.: Bridging the Gap Between Graph Edit Distance and Ker-
    nel Machines. World Scientific Publishing Co., Inc., River Edge, NJ, USA (2007)
10. Nishida, H.: Shape recognition by integrating structural descriptions and geomet-
    rical/statistical transforms. CVIU 64, 248–262 (1996)
11. Riesen, K., Bunke, H.: Approximate graph edit distance computation by means of
    bipartite graph matching. Image Vision Computing 27(7), 950–959 (2009)
12. Rocha, J., Pavlidis, T.: A shape analysis model with applications to a character
    recognition system. IEEE Transaction on PAMI 16–4, 393–404 (1994)
13. Shervashidze, N., Vishwanathan, S.V., Petri, T.H., Mehlhorn, K., Borgwardt,
    K.M.: Efficient graphlet kernels for large graph comparison. In: Twelfth Interna-
    tional Conference on Artificial Intelligence and Statistics (2009)
14. Smola, A.J., Kondor, R.I.: Kernels and regularization on graphs. In: XV Annual
    Conference on Learning Theory. pp. 144–158 (2003)
15. Steinke, F., Schölkopf, B.: Kernels, regularization and differential equations. Pat-
    tern Recognition 41(11), 3271 – 3286 (2008)
16. Suard, F., Rakotomamonjy, A., Bensrhair, A.: Kernel on bag of paths for measur-
    ing similarity of shapes. In: European Symposium on Artificial Neural Networks.
    Bruges-Belgique (April 2007)