Efficient kernels for sentence pair classification

                  Fabio Massimo Zanzotto and Lorenzo Dell’Arciprete

                   University of Rome “Tor Vergata”
                         Via del Politecnico 1
                          00133 Roma, Italy
  zanzotto@info.uniroma2.it,lorenzo.dellarciprete@gmail.com

                                               Abstract
             In this paper, we propose a class of graphs, the tripartite directed acyclic
        graphs (tDAGs), to model first-order rule feature spaces for sentence pair
        classification. We introduce an algorithm for computing the similarity in
        first-order rewrite rule feature spaces. Our algorithm is extremely efficient
        and, as it computes the similarity of instances that can be represented in
        explicit feature spaces, it is a valid kernel function.


1 Introduction
Natural language processing models are generally positive combinations between
linguistic models and automatically learnt classifiers. As trees are extremely impor-
tant in many linguistic theories, a large amount of work exploiting machine learn-
ing algorithms for NLP tasks has been developed for this class of data structures
[3, 14]. These works propose efficient algorithms for determining the similarity
between two trees in tree fragment feature spaces.
     Yet, some NLP tasks such as textual entailment recognition [5, 6] and some
linguistic theories such as HPSG [16] require more general graphs and, then, more
general algorithms for computing similarity among graphs. Unfortunately, algo-
rithms for computing similarity between two general graphs in term of common
subgraphs are still exponential [18]. In these cases, approximated algorithms have
been proposed. For example, the one proposed in [9] counts the number of sub-
paths in common. The same happens for the one proposed in [19] that is appli-
cable to a particular class of graphs, i.e. the hierarchical directed acyclic graphs.
These algorithms do not compute the number of subgraphs in common between
two graphs. Then, these algorithms approximate the feature spaces we need in
these NLP tasks. For computing similarities in these feature spaces, we have to in-
vestigate if we can define a particular class of graphs for the class of tasks we want
to solve. Once we focused the class of graph, we can explore efficient similarity
algorithms.
     A very important class of graphs can be defined for tasks involving sentence
pairs. In these cases, an important class of feature spaces is the one that represents
first-order rewrite rules. For example, in textual entailment recognition [6], we
need to determine whether a text T implies a hypothesis H, e.g., whether or not
“Farmers feed cows animal extracts” entails “Cows eat animal extracts” (T1 , H1 ).

Proceedings of the 16th International RCRA workshop (RCRA 2009):
Experimental Evaluation of Algorithms for Solving Problems with Combinatorial Explosion
Reggio Emilia, Italy, 12 December 2009
If we want to learn textual entailment classifiers, we need to exploit first-order rules
hidden in training instances. To positively exploit the training instance “Pediatri-
cians suggest women to feed newborns breast milk” entails “Pediatricians suggest
that newborns eat breast milk” (T2 , H2 ) for classifying the above example, learn-
ing algorithms should learn that the two instances hide the first-order rule ρ =
  f eed Y Z → Y eat Z . The first-order rule feature space, introduced by [22],
gives high performances in term of accuracy for textual entailment recognition with
respect to other features spaces.
     In this paper, we propose a class of graphs, the tripartite directed acyclic graphs
(tDAGs), that model first-order rule feature spaces and, using this class of graphs,
we introduce an algorithm for computing the similarity in first-order rewrite rule
feature spaces. The possibility of explicitly representing the first-order feature
space as subgraphs of tDAGs makes the derived similarity function a valid ker-
nel. With respect to the algorithm proposed in [15], our algorithm is more efficient
and it is a valid kernel function.
     The paper is organized as follows. In Section 2, we firstly describe tripartite
directed acyclic graphs (tDAGs) to model first-order feature (FOR) spaces. In Sec-
tion 3, we then present the related work. In Section 4, we introduce the similarity
function for these FOR spaces. This can be used as kernel function in kernel-based
machines (e.g., support vector machines [4]). We then introduce our efficient al-
gorithm for computing the similarity among tDAGs. In Section 5, we analyze
the computational efficiency of our algorithm showing that it is extremely more
efficient than the algorithm proposed in [15]. Finally, in Section 6, we draw con-
clusions and plan the future work.


2 Representing first-order rules and sentence pairs as tri-
  partite directed acyclic graphs
As first step, we want to define the tripartite directed acyclic graphs (tDAGs).
This is an extremely important class of graphs for the first-order rule feature spaces
we want to model. We want here to intuitively show that, if we model first-order
rules and sentence pairs as tDAGs, determining whether or not a sentence pair can
be unified with a first-order rewrite rule is a graph matching problem. This intuitive
idea helps in determining our efficient algorithm for exploiting first-order rules in
learning examples.
    To illustrate the above idea we will use an example based on the above rule ρ=
 f eed Y Z → Y eat Z and the above sentence pair (T1 , H1 ). The rule ρ en-
codes the entailment relation of the verb to feed and the verb to eat. If represented
over a syntactic interpretation, the rule has the following aspect:


                                           2
              VP               S                            S                                      S

        VB NP NP    ·     NP       VP              NP               VP                   ·   NP         VP

       feed         ·          VB NP             DT   NN   VB NP              NP         ·   NNS VB           NP

                               eat               The farmer feed NNS     NN        NNS   ·   Cows eat    NN        NNS

                                                                cows animal extracts                    animal extracts
                   (a)                                                         (b)


                        Figure 1: A simple rule and a simple pair as a graph


                                                                               S
                                                 VP
                                                                       NP Y         VP
                                   ρ1 =   VB NP Y NP Z          →
                                                                               VB NP Z
                                          feed
                                                                               eat

As in the case of feature structures [2], we can observe this rule as a graph. As we
are not interested in the variable names but we need to know the relation between
the right hand side and the left hand side of the rule, we can substitute each variable
with an unlabelled node. We then connect tree nodes having variables with the
corresponding unlabelled node. The result is a graph as the one in Figure 1(a). The
variables Y and Z are represented by the unlabelled nodes between the trees.
    In the same way we can represent the sentence pair (T1 , H1 ) using graph with
explicit links between related words and nodes (see Figure 1(b)). We can link
words using anchoring methods as in [17]. These links can then be propagated
in the syntactic tree using semantic heads of the constituents [16]. The rule ρ1
matches over the pair (T1 , H1 ) if the graph ρ1 is among the subgraphs of the graph
in Figure 1(b).
    Both rules and sentence pairs are graphs of the same type. These graphs are
basically two trees connected through an intermediate set of nodes representing
variables in the rules and relations between nodes in the sentence pairs. We will
hereafter call these graphs tripartite directed acyclic graphs (tDAGs). The formal
definition follows.

Definition tDAG: A tripartite directed acyclic graph is a graph G = (N, E) where

   • the set of nodes N is partitioned in three sets Nt , Ng , and A

   • the set of edges is partitioned in four sets Et , Eg , EAt , and EAg

such that t = (Nt , Et ) and g = (Ng , Eg ) are two trees and EAt = {(x, y)|x ∈
Nt and y ∈ A} and EAg = {(x, y)|x ∈ Ng and y ∈ A} are the edges connecting
the two trees.


                                                           3
                                                      S                                         S

                                           NP                 VP                       NP 1         VP

                           P1 = h      NNS          VB NP 1          NP 3             NNS 1 VB                         i
                                                                                  ,                      NP 3

                                      Farmers feed
                                                          NNS 1 NN 2 NNS 3             Cows eat NN 2 NNS 3

                                                           cows animal extracts                     animal extracts
                         S2                                                                         S2


               NP 1                 VP 2                                                 NP 1               VP 2


              NNS 1       VB 2                  S                                       NNS 1        VB 2          SBAR

            Pediatricians suggest    NP               VP                              Pediatricians suggest IN               S
   P2 = h                                                                         ,                                                             i
                                    NNS TO                    VP                                            that      NP 3        VP

                                    women to VB            NP 3       NP 4                                         NNS 3     VB        NP 4

                                                    feed NNS 3     NN 5 NN 4                                       newborns eat NN 5 NN 4

                                                          newborns breast milk                                                    breast milk


                                            Figure 2: Two tripartite DAGs


     A tDAG is a partially labeled graph. The labeling function L only applies to
the subsets of nodes related to the two trees, i.e., L : Nt ∪ Ng → L. Nodes in the
set A are not labeled.
     The explicit representation of the tDAG in Figure 1(b) has been useful to show
that the unification of a rule and a sentence pair is a graph matching problem. Yet,
it is complex to follow. We will then describe a tDAG with an alternative and more
convenient representation. A tDAG G = (N, E) can be seen as pair G = (τ, γ) of
extended trees τ and γ where τ = (Nt ∪A, Et ∪EAt ) and γ = (Ng ∪A, Eg ∪EAg ).
These are extended trees as each tree contains the relations with the other tree.
     As for the feature structures, we will graphically represent a (x, y) ∈ EAt and
a (z, y) ∈ EAg as boxes y respectively on the node x and on the node z. These
nodes will then appear as L(x) y and L(z) y , e.g., NP 1 . The name y is not a label
but a placeholder representing an unlabelled node. This representation is used for
rules and for sentence pairs. The sentence pair in Figure 1(b) is then represented
as reported in Figure 2.


3 Related work
Automatically learning classifiers for sentence pairs is extremely important for ap-
plications like textual entailment recognition, question answering, and machine
translation.
     In textual entailment recognition, it is not hard to see graphs similar to tripar-
tite directed acyclic graphs as ways of extracting features from examples to feed


                                                                         4
automatic classifiers. Yet, these graphs are generally not tripartite in the sense de-
scribed in the previous section and they are not used to extract features representing
first-order rewrite rules. In [17, 10, 11], two connected graphs representing the two
sentences s1 and s2 are used to compute distance features, i.e., features represent-
ing the distance between s1 and s2 . The underlying idea is that lexical, syntactic,
and semantic similarities between sentences in a pair are relevant features to clas-
sify sentence pairs in classes such as entail and not-entail.
     In [7], first-order rewrite rule feature spaces have been explored. Yet, these
spaces are extremely small. Only some features representing first-order rules have
been explored. Pairs of graphs are used here to determine if a feature is active
or not, i.e., the rule fires or not. A larger feature space of rewrite rules has been
implicitly explored in [21] but this work considers only ground rewrite rules.
     In [22], tripartite directed acyclic graphs are implicitly introduced and ex-
ploited to build first-order rule feature spaces. Yet, both in [22] and in [15], the
model proposed has two major limitations: it can represent rules with less than 7
variables and the proposed kernel is not a completely valid kernel as it uses the
max function.
     In machine translation, some methods such as [8] learn graph based rewrite
rules for generative purposes. Yet, the method presented in [8] can model first-
order rewrite rules only with a very small amount of variables, i.e., two or three
variables.


4 An efficient algorithm for computing the first-order rule
  space kernel
In this section, we present our idea for an efficient algorithm for exploiting first-
order rule feature spaces. In Section 4.1, we firstly define the similarity function,
i.e., the kernel K(G1 , G2 ), that we need to determine for correctly using first-order
rules feature spaces. This kernel is strongly based on the isomorphism between
graphs. A relevant idea of this paper is the observation that we can define an
efficient way to detect the isomorphism between the tDAGs (Section 4.2). This
algorithm exploits the efficient algorithms of tree isomorphism as the one implicitly
used in [3]. After describing the isomorphism between tDAGs, We can present
the idea of our efficient algorithm for computing K(G1 , G2 ) (Section 4.3). We
introduce the algorithms to make it a viable solution (Section 4.4). Finally, in
Section 4.5, we report the kernel computation presented by [22, 15]. This latter is
our baseline method.

4.1 Kernel functions over first-order rule feature spaces
The first-order rule feature space we want to model is huge. If we use kernel-based
machine learning models such as SVM [4], we can implicitly define the space by
defining its similarity functions, i.e., its kernel functions. We firstly introduce the


                                          5
first-order rule feature space and we then define the prototypical kernel function
over this space.
     The first-order rule feature space (F OR) is in general the space of all the pos-
sible first-order rules defined as tDAGs. Within this space it is possible to define
the function S(G) that determines all the possible active features of the tDAG G in
F OR. The function S(G) determines all the possible and meaningful subgraphs
of G. We want that these subgraphs represent first-order rules that can be matched
with the pair G. Then, meaningful subgraphs of G = (τ, γ) are graphs (t, g) where
t and g are subtrees of τ and γ, respectively. For example, the subgraphs of P1 and
P2 in Figure 2 are hereafter partially represented:
                                                                                                  S                        S


                  S                    S                NP 1            NP 1               NP          VP           NP 1       VP
S(P1 ) = { h              ,                     i , h            ,                 i , h                        ,                        i ,
                NP VP         NP 1 VP                   NNS 1        NNS 1                      VB NP 1 NP 3               VB NP 3

                                                                                                feed                       eat
                                   S
         VP
                          NP 1             VP
h VB NP 1 NP 3        ,                            i , ... }
                                   VB NP 3
  feed
                                   eat
and
                                                                                                                               S
                                                                                                         VP
                   S2                       S2                  NP 1           NP 1                                     NP 3       VP
S(P2 ) = { h                   ,                        i , h             ,                i , h VB NP 3 NP 4       ,                          i , ... }
                NP 1 VP 2              NP 1 VP 2                NNS 1          NNS 1                                           VB NP 4
                                                                                                  feed
                                                                                                                                 eat
    In the FOR space, the kernel function K should then compute the number of
subgraphs in common. The trivial way to describe the former kernel function is
using the intersection operator, i.e., the kernel K(G1 , G2 ) is the following:

                                                 K(G1 , G2 ) = |S(G1 ) ∩ S(G2 )|                                                                       (1)

This is very simple to write and it is in principle correct. A graph g in the intersec-
tion S(G1 ) ∩ S(G2 ) is a graph that belongs to both S(G1 ) and S(G2 ). Yet, this
hides a very important fact: determining whether two graphs, g1 and g2 , are the
same graph g1 = g2 is not trivial. For example, it is not sufficient to superficially
compare graphs to determine that ρ1 belongs both to S1 and S2 . We need to use
the correct property for g1 = g2 , i.e., the isomorphism between two graphs. We
can call the operator Iso(g1 , g2 ). When two graphs verify the property Iso(g1 , g2 ),
both g1 and g2 can be taken as the graph g representing the two graphs. Detecting
Iso(g1 , g2 ) has an exponential complexity [13].
    This complexity of the intersection operator between sets of graphs deserves a
different way to represent the operation. We will use the same symbol but we will
use the prefix notation. The operator is hereafter re-defined:

               ∩(S(G1 ), S(G2 )) = {g1 |g1 ∈ S(G1 ), ∃g2 ∈ S(G2 ), Iso(g1 , g2 )}


                                                                               6
4.2 Isomorphism between tDAGs
As isomorphism between graphs is an essential activity for learning from structured
data, we here review its definition and we adapt it to tDAGs. We then observe that
isomorphism between two tDAGs can be divided in two sub-problems:

   • finding the isomorphism between two pairs of extended trees

   • checking whether the partial isomorphism found between the two pairs of
     extended trees are compatible.

   In general, two tDAGs, G1 = (N1 , E1 ) and G2 = (N2 , E2 ) are isomorphic (or
match) if |N1 | = |N2 |, |E1 | = |E2 |, and a bijective function f : N1 → N2 exists
such that these properties hold:
   • for each node n ∈ N1 , L(f (n)) = L(n)
   • for each edge (n1 , n2 ) ∈ E1 an edge (f (n1 ), f (n2 )) is in E2

The bijective function f is a member of the combinatorial set F of all the possible
bijective functions between the two sets N1 and N2 .
    The trivial algorithm for detecting if two graphs are isomorphic is exponential
[13]. It explores all the set F. It is still undetermined if the general graph iso-
morphism problem is NP-complete. Yet, we can use the fact that tDAGs are two
extended trees for building a better algorithm. There is an efficient algorithm for
computing isomorphism between trees (as the one implicitly used in [3]).
    Given two tDAGs G1 = (τ1 , γ1 ) and G2 = (τ2 , γ2 ) the isomorphism can be
reduced to the problem of detecting two properties:
   1. Partial isomorphism. Two tDAGs G1 and G2 are partially isomorphic, The
      partial isomorphism produces two bijective functions fτ and fγ .
   2. Constraint compatibility. Two bijective functions fτ and fγ are compatible
      on the sets of nodes A1 and A2 , if for each n ∈ A1 , it happens that fτ (n) =
      fγ (n).

We can rephrase the second property, i.e., the constraint compatibility, as follows.
We define two constraints c(τ1 , τ2 ) and c(γ1 , γ2 ) representing the functions fτ
and fγ on the sets A1 and A2 . The two constraints are defined as c(τ1 , τ2 ) =
{(n, fτ (n))|n ∈ A1 } and c(γ1 , γ2 ) = {(n, fγ (n))|n ∈ A1 }. Two partially isomor-
phic tDAGs are isomorphic if the constraints match, i.e., c(τ1 , τ2 ) = c(γ1 , γ2 ).
    For example, the fourth pair of S(P1 ) and the third pair of S(P2 ) are isomor-
phic as: (1) these are partially isomorphic, i.e., the right hand sides τ and the left
hand sides γ are isomorphic; (2) both pairs of extended trees generate the constraint
c1 = {( 1 , 3 ), ( 3 , 4 )}. In the same way, the second pair of S(P1 ) and the second
pair of S(P2 ) generate c2 = {( 1 , 1 )}


                                          7
                                                                                                                        A1
                                                             A1                      I1                                                    I1
    e a ), S(P
  ∩(S(P    e b ))|c1                          = {      h                ,                   i    ,        h        B1        C1   ,             i       ,
                                                            B1 C1           M1 N1                                                     M1 N1
                                                                                                              B1 B2
             A1                    I1                                            I1
                                                            A1
  h     B1        C1    ,    M1         N1     i , h                ,   M1            N1        i }=
                                                       B1 C1
      B1 B2                        N2 N1                                             N2 N1
                                    A1                                                   I1
             A1                                                   I1
  = {                   ,         B1     C1      }×{                             ,    M1        N1            }=      e a ), S(τ
                                                                                                                    ∩(S(τ    e b ))|c1 ×
         B1 C1                                               M1 N1
                                 B1 B2                                                     N2 N1
    e a ), S(γ
  ∩(S(γ    e b ))|c1


                             Figure 3: Intuitive idea for the kernel computation
                            A1                         I1                                            A1                               I1

      Pa = h       B1             C1     ,     M1            N1         i       Pb = h      B1                C1        ,     M1           N1       i


               B1 B2 C1 C2                   M2 M1 N2 N1                                   B1 B2 C1 C3                       M3 M1 N2 N1


                                        Figure 4: Simple non-linguistic tDAGs


4.3 General idea for an efficient kernel function
As discussed above, two tDAGs are isomorphic if the two properties, the par-
tial isomorphism and the constraint compatibility, hold. To compute the kernel
function K(G1 , G2 ) defined in Section 4.1, we can exploit these properties in the
reverse order. Given a constraint c, we can select all the graphs that meet the
constraint c (constraint compatibility). Having the set of all the tDAGs meeting the
constraint, we can detect the partial isomorphism. We split each pair of tDAGs into
the four extended trees and we determine if these extended trees are compatible.
     We introduce this method to compute the kernel K(G1 , G2 ) in the FOR space
in two steps. Firstly, we give an intuitive explanation and, secondly, we formally
define the kernel.

4.3.1     Intuitive explanation
To give an intuition of the kernel computation, without loss of generality and for
sake of simplicity, we use two non-linguistic tDAGs, Pa and Pb (see Figure 4), and
                         e
the subgraph function S(θ).    This latter is an approximated version of S(θ) that
generates tDAGs with subtrees rooted in the root of the initial trees of θ.
    To exploit the constraint compatibility property, we define C as the set of all
the relevant alternative constraints, i.e., the constraints c that are likely to be gen-


                                                                            8
erated whendetecting the partial isomorphism. For Pa and Pb , this set is C =
{c1 , c2 } = {( 1 , 1 ), ( 2 , 2 )}, {( 1 , 1 ), ( 2 , 3 )} . We can then determine the
kernel K(Pa , Pb ) as:

                         e a ),S(P
           K(Pa ,Pb )=|∩(S(P   e b ))|=|∩(S(P   e b ))|c S ∩(S(P
                                          e a ),S(P          e a ),S(P
                                                                   e b ))|c |
                                                        1                  2


where ∩(S(P e a ), S(P
                   e b ))|c are the common subgraphs that meet the constraint c. A
tDAG g = (τ , γ 0 ) in S(P
         0      0         e a ) is in ∩(S(P
                                          e a ), S(P
                                                  e b ))|c if g00 = (τ 00 , γ 00 ) in S(P
                                                                                      e b)
          0                            00       0       0   00        0     00
exists, g is partially isomorphic to g , and c = c(τ , τ ) = c(γ , γ ) is covered
by and compatible with the constraint c, i.e., c0 ⊆ c. For example in Figure 3,
the first tDAG of the set ∩(S(Pe a ), S(P
                                       e b ))|c belongs to the set as its constraint
                                               1
 0
c = {( 1 , 1 )} is a subset of c1 .
    Observing the kernel computation in this way is important. Elements in ∩(S(P       e a ), S(P
                                                                                              e b ))|c
already satisfy the property of constraint compatibility. We only need to deter-
mine if the partially isomorphic properties hold for elements in ∩(S(Pe a ), S(P   e b ))|c .
Then, we can write the following equivalence:

                     e a ),S(P
                   ∩(S(P   e b ))|c =∩(S(τ
                                       e a ),S(τ
                                             e b ))|c ×∩(S(γ
                                                         e a ),S(γ
                                                               e b ))|c                       (2)
Figure 3 reports this equivalence for the two sets derived using the constraints
c1 and c2 . Note that this equivalence is not valid if a constraint is not applied, i.e.,
   e a ), S(P
∩(S(P       e b )) 6= ∩(S(τ
                         e a ), S(τ
                                e b ))×∩(S(γe a ), S(γ
                                                    e b )). The pair Pa itself does not
             e a ), S(P
belong to ∩(S(P       e b )) but it does belong to ∩(S(τ
                                                      e a ), S(τ
                                                             e b ))×∩(S(γ e a ), S(γ
                                                                                 e b )).
                                                                       e       e
    The equivalence (2) allows to compute the cardinality of ∩(S(Pa ), S(Pb ))|c
                               e a ), S(τ
using the cardinalities of ∩(S(τ       e b ))|c and ∩(S(γ
                                                        e a ), S(γ
                                                                e b ))|c . These latter
sets contain only extended trees where the equivalences between unlabelled nodes
are given by c. We can then compute the cardinalities of these two sets using
methods developed for trees (e.g., the kernel function KS (θ1 , θ2 ) introduced in
[3]).

4.3.2   Formal definition
Given the idea of the previous section, it is easy to demonstrate that the kernel
K(G1 , G2 ) can be written as follows:
                               S
                   K(G1 ,G2 )=| c∈C ∩(S(τ1 ),S(τ2 ))|c ×∩(S(γ1 ),S(γ2 ))|c |

where C is set of alternative constraints and ∩(S(θ1 ), S(θ2 ))|c are all the common
extended trees compatible with the constraint c.
    We can compute the above kernel using the inclusion-exclusion property, i.e.,
                                           X
                    |A1 ∪ · · · ∪ An | =         (−1)|J|−1 |AJ |                  (3)
                                            J∈2{1,...,n}
                                                                               T
where 2{1,...,n} is the set of all the subsets of {1, . . . , n} and AJ =          i∈J Ai .


                                              9
     To describe the application of the inclusion-exclusion model in our case, let
firstly define:
                       KS (θ1 , θ2 , c) = |∩(S(θ1 ), S(θ2 ))|c |               (4)
where θ1 can be both τ1 and γ1 and θ2 can be both τ2 and γ2 . Trivially, we can
demonstrate that:
                                  P
         K(G1 , G2 ) = = J ∈2{1,...,|C|} (−1)|J |−1 KS (τ1 ,τ2 ,c(J))KS (γ1 ,γ2 ,c(J)) (5)
                T
where c(J) = i∈J ci .
    Given the nature of the constraint set C, we can compute efficiently the previ-
ous equation as it often happens that two different J1 and J2 in 2{1,...,|C|} generate
the same c, i.e.                     \            \
                               c=         ci =        ci                               (6)
                                                i∈J1        i∈J2

Then, we can define C ∗ as the set of all intersections of constraints in C, i.e. C ∗ =
{c(J)|J ∈ 2{1,...,|C|} }. We can rewrite the equation as:
                                X
              K(G1 , G2 ) =         KS (τ1 , τ2 , c)KS (γ1 , γ2 , c)N (c)            (7)
                                        c∈C ∗

where                                               X
                                   N (c) =                       (−1)|J|−1                                (8)
                                                J∈2{1,...,|C|}
                                                  c=c(J)

      The complexity of the above kernel strongly depends on the cardinality of C
and the related cardinality of C ∗ . The worst-case computational complexity is still
exponential with respect to the size of A1 and A2 . Yet, the average case complexity
[20] is promising.
      The set C is generally very small with respect to the worst case. If F(A1 ,A2 )
are all the possible correspondences between the nodes A1 and A2 , it happens
that |C| << |F(A1 ,A2 ) | where |F(A1 ,A2 ) | is the worst case. For example, in the
                                                             
case of P1 and P2 , the cardinality of C = {( 1 , 1 )}, {( 1 , 3 ), ( 3 , 4 ), ( 2 , 5 )}
is extremely smaller than the one of F(A1 ,A2 ) = {{( 1 , 1 ), ( 2 , 2 ), ( 3 , 3 )}, {( 1 , 2 ),
( 2 , 1 ), ( 3 , 3 )}, {( 1 , 2 ), ( 2 , 3 ),( 3 , 1 )}, ..., {( 1 , 3 ),( 2 , 4 ),( 3 , 5 )}}. In Section 4.5
we argue that the algorithm presented in [15] has the worst-case complexity.
      Moreover, the set C ∗ is extremely smaller than 2{1,...,|C|} due to the above
property (6).
      We will analyze the average-case complexity with respect to the worst-case
complexity in Section 5.

4.4 Enabling the efficient kernel function
The above idea for computing the kernel function is extremely interesting. Yet,
we need to make it viable by describing the way we can determine efficiently

                                                       10
the three main parts of the equation (7): 1) the set of alternative constraints C
(Section 4.4.1); 2) the set C ∗ of all the possible intersections of constraints in C
(Section 4.4.2); and, finally, 3) the numbers N (c) (Section 4.4.3).

4.4.1   Determining the set of alternative constraints
The first step of equation (7) is to determine the alternative constraints C. We
can here strongly use the possibility of dividing tDAGs in two trees. We build C
as Cτ ∪ Cγ where: 1) Cτ are the constraints obtained from pairs of isomorphic
extended trees t1 ∈ S(τ1 ) and t2 ∈ S(τ2 ); 2) Cγ are the constraints obtained from
pairs of isomorphic extended trees t1 ∈ S(γ1 ) and t2 ∈ S(γ2 ).
    The idea for an efficient algorithm is that we can compute the C without ex-
plicitly looking at all the subgraphs involved. We instead use and combine the
constraints derived comparing the productions of the extended trees. We can com-
pute then Cτ with the productions of τ1 and τ2 and Cγ with the productions of
γ1 and γ2 . For example (see Figure 2), focusing on the τ , the rule N P 3 →
 N N 2 N N S 3 of G1 and N P 4 → N N 5 N N S 4 of G2 generates the
constraint c = {( 3 , 4 ), ( 2 , 5 )}.
    Using the above intuition it is possible to define an algorithm that builds an
alternative constraint set C with the following two properties:

   1. for each common subtree according to a set of constraints c, ∃c0 ∈ C such
      that c ⊆ c0 ;

   2. @c0 , c00 ∈ C such that c0 ⊂ c00 and c0 6= ∅.

4.4.2   Determining the set C ∗
The set C ∗ is defined as the set of all possible intersections of alternative con-
straints in C. Due to the property (6) discussed in Section 4.3, we can empirically
demonstrate that the average complexity of the algorithm for computing C ∗ is not
bigger than O(|C|2 ). Yet, again, the worst case complexity is exponential.

4.4.3   Determining the values of N (c)
The multiplier N (c) (Eq. 8) represents the number of times the constraint c is con-
sidered in the sum of equation 5, keeping into account the sign of the corresponding
addend. It is possible to demonstrate that:
                                             X
                               N (c) = 1 −        Nc 0                            (9)
                                                c0 ∈C ∗
                                                 c0 ⊃c

This recursive formulation of the equation allows us to easily determine the value
of N (c) for every c belonging to C ∗ . It is possible to prove this property using set
properties and the binomial theorem. The proof is omitted for lack of space.


                                           11
4.5 Reviewing the strictly related work
To understand if ours is an efficient algorithm, we compare it with the algorithm
presented by [15]. We will hereafter call this algorithm Kmax . The Kmax algo-
rithm and kernel is an approximation of what is a kernel needed for a FOR space
as it is not difficult to demonstrate that Kmax (G1 , G2 ) ≤ K(G1 , G2 ). The Kmax
approximation is based on maximization over the set of possible correspondences
of the placeholders. Following our formulation, this kernel appears as:

              Kmax (G1 , G2 ) =      max         KS (τ1 , τ2 , c)KS (γ1 , γ2 , c)   (10)
                                  c∈F(A1 ,A2 )


where F(A1 ,A2 ) are all the possible correspondences between the nodes A1 and
A2 of the two tDAGs as the one presented in Section 4.3. This formulation of the
kernel has the worst case complexity of our formulation, i.e., Eq. 7.
     For computing the basic kernel for the extended trees, i.e. KS (θ1 , θ2 , c) we use
the model algorithm presented by [22] and refined by [15] based on the algorithm
for tree fragment feature spaces [3]. As we are using the same basic kernel, we can
empirically compare the two methods.


5 Experimental evaluation
In this section we want to empirically estimate the benefits in terms of the computa-
tional cost of our algorithm with respect to the algorithm proposed by [15]. Our al-
gorithm is in principle exponential with respect to the set of alternative constraints
C. Yet, given the ideas in Section 4.4 and as the set C ∗ is usually very small, the
average complexity is extremely low. Following the theory on the average-cost
computational complexity [20], we estimated the behavior of the algorithms on a
large distribution of cases. We then compared the computing times of the two al-
gorithms. Finally, as K and Kmax compute slightly different kernels, we compare
the accuracy of the two methods. We implemented both algorithms K(G1 , G2 )
and Kmax (G1 , G2 ) in support vector machine classifier [12] and we experimented
with both implementations on the same machine. We hereafter analyze the results
in term of execution time (Section 5.1) and in term of accuracy (Section 5.2).

5.1 Average computing time analysis
For the first set of experiments, the source of examples is the one of the recognizing
textual entailment challenge, i.e., RTE2 [1]. The dataset of the challenge has 1,600
sentence pairs.
    The computational cost of both K(G1 , G2 ) and Kmax (G1 , G2 ) depends on
the number of placeholders n = |A1 | of G1 and on m = |A2 | the number of
placeholders of G2 . Then, in the first experiment we want to determine the relation
between the computational time and the factor n × m. Results are reported in
Figure 5(a) where the computation times are plotted with respect to n × m. Each

                                           12
             50
                        K(G1 , G2 )                           1600             K(G1 , G2 )
                      Kmax (G1 , G2 )                                        Kmax (G1 , G2 )
             40                                               1400
                                                              1200
             30                                               1000
        ms                                                s   800
             20
                                                              600

             10                                               400
                                                              200
              0                                                  0
                  0   10    20     30     40   50                    0   2    4 6 8 10 12 14
                        n × m placeholders                                    #of placeholders

       (a) Mean execution time in mil-                   (b) Total execution time in seconds
       liseconds (ms) of the two algo-                   (s) of the training phase on RTE2
       rithms wrt. n × m where n and m                   wrt. different numbers of allowed
       are the number of placeholders of                 placeholders
       the two tDAGs


                       Figure 5: Comparison of the execution times


point in the curve represents the average execution time for the pairs of instances
having n×m placeholders. As expected, the computation of the function K is more
efficient than the computation Kmax . The difference between the two execution
times increases with n × m.
     We then performed a second experiment that determines the relation of the to-
tal execution with the maximum number of placeholders in the examples. This is
useful to estimate the behavior of the algorithm with respect to its application in
learning models. Using the RTE2 data, we artificially build different versions with
increasing number of placeholders. We then have RTE2 with one placeholder at
most in each pair, RTE2 with two placeholders, etc. The number of pairs in each
set is the same. What changes is the maximal number of placeholders. Results are
reported in Figure 5(b) where the execution time of the training phase in seconds
(s) is plotted for each different set. We see that the computation of Kmax is expo-
nential with respect to the number of placeholders and it becomes intractable after
7 placeholders. The computation of K is instead more flat. This can be explained
as the computation of K is related to the real alternative constraints that appears in
the dataset. The computation of the kernel K then outperforms the computation of
the kernel Kmax .

5.2 Accuracy analysis
As Kmax that has been demonstrated very effective in term of accuracy for RTE
and K compute a slightly different similarity function, we want to show that the
performance of our more computationally efficient K is comparable, and even bet-
ter, to the performances of Kmax . We then performed an experiment taking as
training all the data derived from RTE1, RTE2, and RTE3, (i.e., 4567 training ex-

                                                    13
                   Kernel    Accuracy     Used training    Support
                                            examples       Vectors
                    Kmax       59.32          4223          4206
                     K         60.04          4567          4544


                 Table 1: Comparative performances of Kmax and K

amples) and taking as testing RTE-4 (i.e., 1000 testing examples). The results are
reported in Table 1. As the table shows, the accuracy of K is higher than the
accuracy of Kmax . There are two main reasons. The first is that Kmax is an ap-
proximation of K. The second is that we can now consider sentence pairs with
more than 7 placeholders. Then, we can use the complete training set as the third
column of the table shows.


6 Conclusions and future work
We presented an interpretation of first order rule feature spaces as tripartite directed
acyclic graphs (tDAGs). This view on the problem gave us the possibility of
defining a novel and efficient algorithm for computing the kernel function for first
order rule feature spaces. Moreover, the resulting algorithm is a valid kernel as
it can be written as dot product in the explicit space of the tDAG fragments. We
demonstrated that our algorithm outperforms in term of average complexity the
previous algorithm and it yields to better accuracies for the final task. We are
investigating if this is a valid algorithm for two general directed acyclic graphs.


References
 [1] R. Bar-Haim, I. Dagan, B. Dolan, L. Ferro, D. Giampiccolo, and I. Magnini,
     Bernardo Szpektor. The second pascal recognising textual entailment challenge. In
     Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual
     Entailment. Venice, Italy, 2006.
 [2] B. Carpenter. The Logic of Typed Feature Structures. Cambridge University Press,
     Cambridge, England, 1992.
 [3] M. Collins and N. Duffy. New ranking algorithms for parsing and tagging: Kernels
     over discrete structures, and the voted perceptron. In Proceedings of ACL02. 2002.
 [4] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:1–25,
     1995.
 [5] I. Dagan and O. Glickman. Probabilistic textual entailment: Generic applied mod-
     eling of language variability. In Proceedings of the Workshop on Learning Methods
     for Text Understanding and Mining, Grenoble, France, 2004.
 [6] I. Dagan, O. Glickman, and B. Magnini. The pascal recognising textual entailment
     challenge. In Q.-C. et al., editor, LNAI 3944: MLCW 2005, pages 177–190, Milan,
     Italy, 2006. Springer-Verlag.

                                          14
 [7] M.-C. de Marneffe, B. MacCartney, T. Grenager, D. Cer, A. Rafferty, and C. D. Man-
     ning. Learning to distinguish valid textual entailments. In Proceedings of the Sec-
     ond PASCAL Challenges Workshop on Recognising Textual Entailment, Venice, Italy,
     2006.
 [8] J. Eisner. Learning non-isomorphic tree mappings for machine translation. In Pro-
     ceedings of the 41st Annual Meeting of the Association for Computational Linguistics
     (ACL), Companion Volume, pages 205–208, Sapporo, July 2003.
 [9] T. Gärtner. A survey of kernels for structured data. SIGKDD Explorations, 2003.
[10] A. D. Haghighi, A. Y. Ng, and C. D. Manning. Robust textual inference via graph
     matching. In HLT ’05: Proceedings of the conference on Human Language Tech-
     nology and Empirical Methods in Natural Language Processing, pages 387–394,
     Morristown, NJ, USA, 2005. Association for Computational Linguistics.
[11] A. Hickl, J. Williams, J. Bensley, K. Roberts, B. Rink, and Y. Shi. Recognizing
     textual entailment with LCCs GROUNDHOG system. In B. Magnini and I. Dagan,
     editors, Proceedings of the Second PASCAL Recognizing Textual Entailment Chal-
     lenge, Venice, Italy, 2006. Springer-Verlag.
[12] T. Joachims. Making large-scale svm learning practical. In B. Schlkopf, C. Burges,
     and A. Smola, editors, Advances in Kernel Methods-Support Vector Learning. MIT
     Press, 1999.
[13] J. Köbler, U. Schöning, and J. Torán. The graph isomorphism problem: its structural
     complexity. Birkhauser Verlag, Basel, Switzerland, Switzerland, 1993.
[14] A. Moschitti. A study on convolution kernels for shallow semantic parsing. In pro-
     ceedings of the ACL, Barcelona, Spain, 2004.
[15] A. Moschitti and F. M. Zanzotto. Fast and effective kernels for relational learning
     from texts. In Proceedings of the International Conference of Machine Learning
     (ICML). Corvallis, Oregon, 2007.
[16] C. Pollard and I. Sag. Head-driven Phrase Structured Grammar. Chicago CSLI,
     Stanford, 1994.
[17] R. Raina, A. Haghighi, C. Cox, J. Finkel, J. Michels, K. Toutanova, B. MacCartney,
     M.-C. de Marneffe, M. Christopher, and A. Y. Ng. Robust textual inference using
     diverse knowledge sources. In Proceedings of the 1st Pascal Challenge Workshop,
     Southampton, UK, 2005.
[18] J. Ramon and T. Gärtner. Expressivity versus efficiency of graph kernels. In First
     International Workshop on Mining Graphs, Trees and Sequences, 2003.
[19] J. Suzuki, T. Hirao, Y. Sasaki, and E. Maeda. Hierarchical directed acyclic graph ker-
     nel: Methods for structured natural language data. In In Proceedings of the 41st An-
     nual Meeting of the Association for Computational Linguistics, pages 32–39, 2003.
[20] J. Wang. Average-case computational complexity theory. pages 295–328, 1997.
[21] R. Wang and G. Neumann. Recognizing textual entailment using a subsequence
     kernel method. In Proceedings of the Twenty-Second AAAI Conference on Artificial
     Intelligence (AAAI-07), July 22-26, Vancouver, Canada, 2007.
[22] F. M. Zanzotto and A. Moschitti. Automatic learning of textual entailments with
     cross-pair similarities. In Proceedings of the 21st Coling and 44th ACL, pages 401–
     408. Sydney, Australia, July 2006.


                                            15