<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Compositional Perspective in Convolution Kernels</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roberto Basili</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Annesi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Castellucci</string-name>
          <email>castellucci@ing.uniroma2.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danilo Croce</string-name>
          <email>croceg@info.uniroma2.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electronic Engineering, University of Roma Tor Vergata</institution>
          ,
          <addr-line>Roma</addr-line>
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Enterprise Engineering, University of Roma Tor Vergata</institution>
          ,
          <addr-line>Roma</addr-line>
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Kernel-based learning has been largely adopted in many semantic textual inference tasks. In particular, Tree Kernels (TKs) have been successfully applied in the modeling of syntactic similarity between linguistic instances in Question Answering or Information Extraction tasks. At the same time, lexical semantic information has been studied through the adoption of the so-called Distributional Semantics (DS) paradigm, where lexical vectors are acquired automatically from large-scale corpora. Recently, Compositional Semantics phenomena arising in complex linguistic structures have been studied in an extended paradigm called Distributional Compositional Semantics (DCS), where, for example, algebraic operators on lexical vectors have been defined to account for grammatically typed bi-grams or complex verb or noun phrases. In this paper, a novel kernel called Compositionally Smoothed Partial Tree Kernel is presented to integrate DCS operators into the tree kernel evaluation by also considering complex compositional nodes. Empirical results on well-known NLP tasks show that state-of-the-art performances can be achieved, without resorting to manual feature engineering, thus suggesting that a large set of Web and text mining tasks can be handled successfully by this kernel.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Convolution Kernels [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] are well-known similarity functions among complex
syntactic and semantic structures. They are largely used to solve natural language related
tasks, such as Opinion Mining [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] in recommender systems, Semantic Textual
Similarity [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] in retrieval, or Question Classification [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] in question answering systems.
In particular, Tree Kernels (TKs) introduced in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] are used for their ability in
capturing syntactic information, directly from syntactic parse trees. When training supervised
learning algorithms, they are a valid approach to avoid the difficulty of manually
designing effective features from linguistic structures. They define a similarity measure
between data points (i.e. trees) in terms of all possible shared substructures.
      </p>
      <p>
        In order to take into account the lexical generalization provided from the analysis
of large scale document collections, i.e. Distributional Semantic Models [
        <xref ref-type="bibr" rid="ref19 ref25 ref26">19, 26, 25</xref>
        ],
a recent formulation of these kernel functions, called Smoothed Partial Tree Kernel
(SPTK), has been introduced in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The semantic information related to the lexical
nodes of a parse tree is exploited for matching sub-structures characterized by different
but (semantically) related lexical items: it means that the similarity function between
sub-trees depends also on a distributional metric between vectors representing the
semantic of words. The main limitations of this approach are that a) lexical semantic
information only relies on the vector metrics applied to the nodes in a context free fashion
and b) the semantic composition between words is neglected in the kernel computation,
that only depends on their grammatical labels.
      </p>
      <p>
        In our perspective, semantic information is expressed by all nodes of the parse tree
where specific compositions between heads and modifiers are shown. For example in
the sentence, “What instrument does Hendrix play?”, the correct meaning of the verb
play is fully captured if its modifiers, i.e. the noun instrument, is taken into account.
This corresponds to model the contribution of the lexical item instrument as a function
of the corresponding direct object relation (dobj) between play and instrument: we
can say that instrument contributes to the sentence semantics by filling compositionally
the slot dobj of its head play. In other words it seems that lexical semantic information
should propagate over the entire parse tree, by making the compositionality phenomena
of the sentence explicit. In recent years, Distributional Compositional Semantic (DCS)
metrics have been proposed in order to combine lexical vector representations by
algebraic operators defined in the corresponding distributional space [
        <xref ref-type="bibr" rid="ref13 ref21 ref32 ref4">21, 13, 32, 4</xref>
        ].
      </p>
      <p>
        The purpose of this paper is to cast the use of words within the compositional
relationships exposed by a tree in order to make it sensible to the underlying complex
constituents. Even when the syntactic structure does not change, such as in “play an
instrument” vs. “play a match”, the contribution of the lexical information (e.g. the
semantics of the verb “play”) must be correspondingly different when estimating the
kernel function with a phrase like “win a game”. In traditional kernel computations,
the similarity between play and win are context-free and the information of the direct
object is neglected. The ideas underlying the Compositionally Smoothed Partial Tree
Kernel (CSPTK), proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], are to i) use the SPTK formulation in order to exploit
the lexical information encoded in a tree, ii) define a procedure to encode head-modifier
information in the parse tree nodes in order to estimate the lexical similarity between
sub-trees, iii) apply compositional distributional metrics to enforce a context-dependent
estimation of the similarity of individual head-modifier information at the nodes.
      </p>
      <p>In Section 2 a summary of the approaches for DCS and TKs is presented. In Section
3, a lexical mark-up method for parse trees is described to support the CSPTK
formulation, presented in Section 4. Finally, Section 5 reports the empirical evaluation of the
CSPTK over Question Classification and Paraphrase Identification tasks.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>This section reports a summary of the approaches for Distributional Compositional
Semantics and Convolution Kernels.</p>
      <p>
        Distributional Compositional Semantics. Vector-based models typically represent
isolated words and ignore grammatical structure [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] (a noticeable exception is [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]). They
have thus a limited capability to model compositional operations over phrases and
sentences. Distributional methods have been recently extended to better account for
compositionality, in the so-called Distributional Compositional Semantics (DCS) approaches.
Existing models are still controversial and provide general algebraic operators over
lexical vectors and sophisticated composition methods. In [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] compositionality of two
vector u and v is accounted by the tensor product u v, while in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] lexical
vectors are summed, keeping the resulting vector with the same dimension of the latter. In
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] a semantic composition is seen as a function p = f (u; v; R; K) of two vectors u
and v, their syntactic relation R and the factor K, defined as any additional knowledge
or information which is needed to construct the semantics of their composition. This
perspective clearly leads to a variety of efficient yet shallow models of compositional
semantics. Two simplified models are derived from these general forms: i) the additive
model p+ = u + v, where the weights and are applied the lexical vectors; ii)
the multiplicative model p = u v, where the symbol represents the point-wise
multiplication, i.e. pi = ui vi. In [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] a more complex asymmetric type of function
called dilation, i.e. pd = (u u)v + ( 1)(u v)u, is introduced.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], the concept of a structured vector space is introduced, where each word
is associated to a set of vectors corresponding to different syntactic dependencies.
Every word is expressed by a tensor, and tensor operations are imposed. In [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] a similar
geometric approach over a space of second-order distributional vectors is presented.
Vectors represent the words typically co-occurring with the contexts in which the target
word appears. A parallel strand of research also seeks to represent the meaning of larger
compositional structures using matrix and tensor algebra, like in [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] and in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] composition is characterized from formal semantics in terms of a function
application, where the distributional representation of one element in a composition (the
functor) is not a vector but a function. Moreover, a hybrid Logic-Distributional Model
is presented in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], whilst an approach based on vector permutation and Random
Indexing technique, i.e. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], is presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] composition is expressed via a
projection operation into a subspace, i.e. a subset of the original features of a
syntactically motivated compound. A projection is a mapping (a selection function) over the set
of features more representative of the compound: a subspace local to the (play,guitar)
phrase can be found such that only the features specific to its meaning are selected. The
resulting subspace of a compound should thus select the most appropriate word senses,
neglecting the irrelevant ones and preserving the compositional semantics of the phrase.
Convolution Tree Kernels. Convolution Kernels, introduced in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], determine a
similarity function among discrete structures, such as sequences, strings or trees. They
are representationally efficient ways to encode similarity metrics able to support
supervised learning algorithms for complex textual inferences (e.g. semantic role
classification). In particular Tree Kernels (TKs) allow estimating the similarity among
texts, directly from the sentence syntactic structures, represented by trees, e.g. the
dependency parse trees. A TK computes the number of substructures (as well as their
partial fragments) shared by two parse trees T1 and T2. For this purpose, let the set
F = ff1; f2; : : : ; fjFjg be a space of tree fragments and i(n) be an indicator function:
it is 1 if the target fi is rooted at node n and 0 otherwise. A tree-kernel is a function
T K(T1; T2) = Pn12NT1 Pn22NT2 (n1; n2); where NT1 and NT2 are the sets of the
T1’s and T2’s nodes and (n1; n2) = PjiF=j1 i(n1) i(n2). The function recursively
compute the amount of similarity derived from the number of common substructures.
The type of considered fragments determines the expressiveness of the kernel space and
different tree kernels are characterized by different choices. In particular, the Smoothed
Partial Tree Kernels, as discussed in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], measure the similarity between syntactic
tree structures, which are semantically related, i.e. partially similar, even when nodes,
e.g. words at the leaves, differ. This is achieved by the following formulation of the
function over nodes ni 2 Ti:
(n1; n2)= (n1; n2); if n1 and n2 are leaves, else
      </p>
      <p>Here, I 1j represents the sequence of subtrees, dominated by node n1, that are shared
with the children of n2 (i.e. I 2j ): all other non-matching substructures are neglected3.
The novelty of SPTK is represented by the introduction of a similarity function
between nodes, which are typed according to syntactic categories, pos-tags or lexical
entries. A boolean measure of similarity is applied between non lexical nodes by
assigning 1 for identical syntactic or POS nodes, 0 otherwise. For lexical nodes the cosine
similarity is applied between words sharing the same pos-tag. One main limitation
of SPTK is that lexical similarity does not consider compositional interaction between
words. Given the phrase pairs (np (nn river)(nn bank)) and (np (nn savings)(nn bank)),
the SPTK would estimate the similarity by relying on a unique meaning for bank, that
is wrong whenever one considers the compositional role of modifiers, river vs. savings.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Making lexical composition explicit in parse trees</title>
      <p>In this Section, a way of representing compositional information is discussed as a
labeling of individual nodes in the tree. The result is an enriched representation, on which
Distributional Compositional Operators can be applied while evaluating the SPTK
function, i.e. resulting in a Compositionally Smoothed Partial Tree Kernel (CSPTK).</p>
      <p>
        Compositional semantic constraints over a tree kernel computation can be applied
only when individual syntagms corresponding to nodes are made explicit. The CSPTK
takes as input two trees where nodes express a lexical composition: this is used in the
recursive compositional matching foreseen by the underlying convolution model, i.e.
the same as in SPTK. Given the question “What instrument does Hendrix play?” and
its dependency structure, the relative syntactic structure is shown in Figure 1 in terms
of a Lexically Centered Tree (LCT), as proposed in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Nodes in LCTs can be partitioned into: lexical nodes, i.e. nodes n 2 N T ,
representing one lexical item, in terms of hln::posni, such as instrument::n or play::v, where
l is the lemma of the token and pos its part-of-speech4; terminal nodes, n 2 T , that
are the children of lexical nodes which encode a dependency function d 2 D (e.g.
nsubj) or the pos-tag of the node father (e.g. NN).</p>
      <p>An additional tree can be derived by emphasizing the importance of syntactic
information in the so-called Grammatical Relation Centered Tree (GRCT) in Figure 3,
3 Two decay factors, and , are introduced, respectively for the height of the tree and for the
length of the child sequences.
4 General POS tags are obtained from the PennTreebank standard by truncating at their first char
(as in instrument::n).
where nodes are partitioned into: syntactic nodes, i.e. nodes n 2 N T , that encode
dependency functions d 2 D (e.g. nsubj); pre-terminal nodes, n 2 PT , that encode
pos-tag of each lexical item that is assigned to a leaf; lexical nodes, i.e. nodes n 2 T ,
representing one lexical item, in terms of hln::posni.</p>
      <p>We aim at introducing lexical compositionality in these syntactic representation,
through the explicit mark-up of every compositional (sub)structure. Notice how
grammatical dependencies in a LCT representation are encoded as direct descendants of a
modifier non terminal. For example, the dependency dobj is a direct descendant of the
lexical node instrument::n and expresses its grammatical role in the relationship with
its parent node play::v. We propose to mark up such lexical nodes with the full
description of the corresponding head/modifier relationship, denoted by (h; m). In order to
emphasize the semantic contribution of this relationship, the lexical information about
the involved head (lh) and modifier (lm) must be represented: we denote it through a
4-tuple hlh::posh,lm::posmi. Therefore, in an extended (compositional representation
for LCTs) every non terminal n 2 N T is marked as</p>
      <p>
        hdh;m; hlh :: posh; lm :: posmii
where dh;m 2 D is the dependency function between h and m and li and posi are
lexical entries, and pos-tags. Moreover, one lexical node is also added to represent
the simple lexical information of the involved modifier, in order to limit data sparseness.
Notice that this mark-up resembles closely the representation of immediate dominance
rules for grammatical phrases in Attribute Value Matrices (AVM) expressing feature
structures in HPSG-like formalisms [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>Figure 2 shows a fully compositional labeled tree for the sentence whose
unlabeled version has been shown in Figure 1. Now, non terminal nodes represent
compositional lexical pairs of type head/modifier marked as in Eq. 2. For example, the node
dobjhplay::v,instrument::ni express the direct object relation between the verb to play
and the object instrument, as shown in Fig. 2. Dependency functions (dobj) and
POSTags (VBZ) are encoded in terminal nodes as in the LCT before. Eventually, simple
lexical nodes, e.g. play::v, are added as terminal nodes.
(2)
In a similar fashion, the GRCT in Figure 3 can be extended in the Compositional
counterpart, i.e. the CGRCT in Figure 4. In this case non terminal nodes represent
compositional lexical pairs of type head/modifier marked as in Eq. 2; pos-tags are
encoded in pre-terminal nodes as in the GRCT before; finally, simple lexical nodes,
e.g. play::v, are preserved as terminal nodes. Every compositional node supports the
application of a compositional distributional semantic model, where lexical entries for
roothplay::v,*::*i
dobjhplay::v,instrument::ni</p>
      <p>auxhplay::v,do::vi nsubjhplay::v,Hendrix::ni root VB play::v</p>
      <p>det WDT what::w
dethinstrument::n,what::wi dobj NN instrument::n aux VBZ do::v
nsubj NNP Hendrix::n
heads (lh :: posh) and modifiers (lm :: posm) correspond to unique vectors. Given two
subtrees T1 and T2, rooted at nodes n1 and n2 and the corresponding head-modifier
pairs p1 = (h1; m1) and p2 = (h2; m2), a shallow compositional function, independent
from any dependency relation dh;m, is defined over non terminal nodes, by adopting one
of the DCS models discussed in Section 2 so that:</p>
      <p>Comp (h1; m1); (h2; m2) =
(h1; m1); (h2; m2)
(3)</p>
      <p>Hereafter, the application of the DCS models to the SPTK kernel computation is
discussed.</p>
      <p>dobj
det</p>
      <p>NN
what::w
WDT instrument::n do::v Hendrix::n
root
aux
VBZ
nsubj
NNP</p>
      <p>VB
play::v</p>
    </sec>
    <sec id="sec-4">
      <title>The Compositionally Smoothed Partial Tree Kernel</title>
      <p>When the compositionality of individual lexical constituents is made explicit at the
nodes, a compositional measure of similarity between entire parse trees can be
computed through a Tree Kernel. We define here such a similarity function as the
Compositionally Smoothed Partial Tree Kernel (CSPTK), by extending the SPTK formulation.
Let us consider the application of a SPTK to a tree pair represented in their CLCT
form: for example, let us consider the trees derived from sentences such as “Hendrix
plays guitar” and “Hendrix plays football”. Although the kernel recursively estimates
the similarity among all nodes through the function in Equation 1, it would not be
able to distinguish different senses for the verb to play, as it is applied on a unique
distributional vector. The contribution of the lexical information (e.g. the vector for the
verb to play) must be correspondingly dependent from the other words in the sentence.</p>
      <p>(nx; ny; lw) Compositional estimation of the lexical contribution in the CSPTK</p>
      <sec id="sec-4-1">
        <title>Algorithm 1</title>
        <p>0,
/*Matching between simple lexical nodes*/
if nx = hlexx::posi and ny = hlexy::posi then</p>
        <p>LEX (nx; ny)
end if
/*Matching between identical grammatical nodes,
e.g. POS tags*/
if (nx = pos or nx = dep) and nx = ny then</p>
        <p>lw
end if
if nx = dh;m; hlixi and ny = dh;m; hliyi then
/*both modifiers are missing*/
if lix = hhx::posi and liy = hhy::posi then</p>
        <p>Comp (hx); (hy) = LEX (nx; ny)
end if
/*one modifier is missing*/
if lix = hhx::poshi and liy = hhy::posh; my::posmi then</p>
        <p>Comp (hx; hx); (hy; my)
end if
/*the general case*/
if lix = hhx::posh; mx::posmi and
liy = hhy::posh; my::posmi then</p>
        <p>Comp (hx; mx); (hy; my)
end if
end if
return</p>
        <p>The core novelty of the CSPTK is thus a new estimation of as described in
Algorithm 1. Notice that for simple lexical nodes, a lexical kernel LEX , such as the
cosine similarity between vectors of words sharing the same POS-tag, is applied.
Moreover, the other non lexical nodes contribute according to a strict matching policy: they
provide full similarity only when the same POS, or dependency, is matched and 0
otherwise. The novel part of Algorithm 1 correspond to the treatment of compositional
nodes. The similarity function Comp between such nodes is computed only when they
exhibit the same dh;m and the respective heads and modifiers share the POS label: then,
a compositional metric is applied over the two involved (h; m) compounds. The
metric depends on the involved compounds that can be only partially instantiated, so that
different strategies must be applied:
– General case. When fully instantiated compounds (h1; m1) and (h2; m2) are
available, the similarity function of Equation 3 is applied as usual. Notice that the
POStags of heads and modifiers must be pairwise identical.
– Both modifiers are missing. When modifiers mi are unexpressed in both trees, no
composition is observed: a lexical kernel LEX , i.e. the cosine similarity between
unique word vectors h1 and h2, is applied.
– One modifier is missing. Let be (h1; ) and (h2; m2) the lexical information of
the two nodes n1 (that lacks of the modifier) and n2. Here, the general similarity
function is applied to the pair (h1; h1) and the pair (h2; m2), so that no specific
lexical semantic constraint is applied to the head h1.</p>
        <p>As formally described in Algorithm 1, other cases are discarded, e.g. heads are
unknown or different POS tags are observed for the heads hi of the n1 and n2 nodes.
The factor lw is here adopted to reduce the contribution of non lexical nodes.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental Evaluations</title>
      <p>
        In order to study the behavior of the proposed compositional kernels onto the
automation of semantic inference over texts, we carried out experiments over two fine grained
linguistic tasks, i.e. Question Classification (QC) and Semantic Text Similarity (STS).
In all experiments, texts are processed with the Stanford CoreNLP5 and the resulting
dependency trees are extended with compositional information as discussed in Section
3. The lexical similarity function LEX exploited in SPTK and CSPTK through the
Algorithm 1 was based on a distributional analysis of the UkWaC corpus, that gives
rise to a word vector representation for all words occurring more than 100 times (i.e.
the targets), as discussed in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The vector dimensions correspond to co-occurrences
within a left or right window of 3 tokens between the targets and the set of the 20,000
most frequent words (i.e. features) in UkWaC. Scores correspond to point-wise Mutual
Information values between each feature and a target6 across the entire corpus. The
Singular Value Decomposition is then applied to the input matrix with a space
dimensionality cut at k = 250. The similarity LEX required by Algorithm 1 corresponds to
the cosine similarity in the resulting space, as in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
5.1
      </p>
      <sec id="sec-5-1">
        <title>Question Classification</title>
        <p>
          Question Classification (QC) is the task of assigning one (or more) class label(s) to a
given question written in natural language. Question classification plays an important
role in Question Answering (QA) [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ], as it has been shown that its performance
significantly influences the overall performance of a QA system, e.g. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Thanks to
availability of our kernel similarity models, the QC task has been modeled directly in terms
of the parse trees representing questions, i.e. the classification objects. The reference
corpus is the UIUC dataset [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], including 5,452 questions for training and 500
questions for test7, organized in six coarse-grained classes, i.e. ABBREVIATION, ENTITY,
DESCRIPTION, HUMAN, LOCATION and NUMBER. SVM training has been carried
out by applying (i) the PTK and SPTK kernels over the LCT and GRCT representations
and (ii) the compositional tree kernels (CSPTKs), according to different compositional
similarity metrics LEX , to the CLCT and CGRCT formats. For learning our models,
we used the Kernel-based Learning Platform8, which includes structural kernels, i.e.,
5 http://nlp.stanford.edu/software/corenlp.shtml
6 Left contexts of targets are treated separately from the right ones so that 40,000 features are
used derived for each target.
7 http://cogcomp.cs.illinois.edu/Data/QA/QC/
8 http://sag.art.uniroma2.it/demo-software/kelp/
STK and PTK: i) with the smooth match between tree leaves, i.e. the SPTK defined in
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and ii) with the smooth match between compositionally labeled nodes of the tree,
as in the CSPTK. Different tree representations are denoted in the subscripts, so that
CSPTKCLCT refers to the use of a CSPTK over the compositionally labeled LCT (as
in Fig. 2). As the compositional kernel has to make use of a compositional similarity
metrics, the notation of different kernels is specialized according to the adopted
metrics. Additive (with = ), dilation (with = 1) and point-wise multiplication models
have been adopted and denoted by +, d and superscripts respectively.
        </p>
        <p>
          The accuracy achieved by the different systems is the percentage of sentences that
are correctly assigned to the proper question class, and it is reported in Table 1. The
parameter of the SVM has been estimated in a held-out subset sampled from the training
material. As a baseline, a simple Linear kernel (LIN) over a bag-of-word model
(denoted by BoW) is reported (row 1) representing questions as binary word vectors and
resulting in a lexical overlap kernel. A first observation is that the introduction of lexical
semantic information in tree kernel operators, such as in SPTK vs. PTK, is beneficial,
confirming the outcomes of previous studies, [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Moreover, the compositional kernels
proposed (CSPTK) seem to make an effective use of the compositional distributional
semantic information as they all outperform their non compositional counterpart. Among
the compositional kernels the ones adopting the simple sum operator, i.e. CSPTK+
CLCT ,
seem to outperform all the other compositional operators. This remains true
independently from the type of tree adopted, i.e. LCT vs. GRCT. The Lexically Centered Tree
seems to better exploit compositional information, as it achieves, on average, higher
accuracy than GRCT, given that it emphasizes lexical nodes that occur closer to the tree
root an thus contribute mostly to the kernel recursive computation.
5.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>The Semantic Text Similarity task</title>
        <p>
          The second experiment aims at evaluating the contribution of the different kernels in
estimating a semantic similarity between text pairs. A meaningful similarity function
between texts is crucial in many NLP and Retrieval tasks such as Textual Entailment,
Paraphrasing, Search Re-ranking, Machine Translation, Text Summarization or Deep
Question Answering. Kernel functions are evaluated against the dataset of the Semantic
Text Similarity (STS) task built in SemEval 2012 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], in particular the MSR Video
Paraphrase (MSRvid) dataset . This dataset is composed of 1,500 sentence pairs (split
in 750 train and 750 test pairs). Each sentence is a short description of a video and the
pairs have been manually labeled through Amazon Mechanical Turk with a numerical
score (between 0 and 5) reflecting the similarity between the actions and actors involved
in the video. As an example the sentences “A woman is slicing an onion” and “A woman
is cutting an onion” have a similarity of 4.2, while “A man is talking” and “A man is
walking” have a similarity of 1. The task consists in assigning a similarity score to each
test pair and the performance is measured in terms of the Pearson Correlation between
the obtained scores and the gold-standard ones, as reported in Table 2.
        </p>
        <p>
          The same kernels and syntactic structures of the previous experiments have been
considered. The kernel functions have been applied to the STS task both in an
unsupervised as well as a supervised fashion. In the first case, as shown in the top rows of
Table 2, each kernel has been evaluated by measuring the Pearson correlation between
the result of the kernel function between each text pair and the human score. In the
supervised setting, shown in the bottom rows of Table 2, the kernel scores are used to
populate a feature vector used to train a SVM based regressor [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ], which determines
the best kernel combination approximating the human scores over the test material.
        </p>
        <p>The results achieved by the single kernels suggests that the semantic similarity in
the MSRVid dataset strongly depends on the lexical overlap between sentence pairs, and
a regressor trained with a linear kernel over the Bag-of-Word representations (LinBoW )
achieve a correlation of 0.537. This result is higher with respect to the unsupervised
application of the PTK where only full syntactic information is used; this is mainly due
to the fact that most sentences are in the form Subject-Verb-Object and the
syntactic information is not discriminative. The simple smoothing over lexical nodes in the
SPTK is still not very informative, as the smoothing over the pure lexical information
may be misleading. As an example, the sentence pair “A man plays a guitar” has a low
similarity (i.e. 2.0) w.r.t. “A boy plays a piano” as they evoke different actions: while
generalizing is useful in comparing boy and man, it can be misleading in comparing
guitar and piano. Such smoothing is more controlled in the CSPTK that achieves better
results. The above difference is mainly due to the increasing sensitivity of PTK, SPTK
and CSPTK to the incrementally rich lexical information. This is especially evident
in sentence pairs with very similar syntactic structure. For example a sentence pair is
given by The man are playing soccer and A man is riding a motorcycle, that are strictly
syntactically correlated. In fact, PTK provide a similarity score of 0:647 between the
to sentences as differences between tree structures is confined to the leaves. By
scoring 0:461, SPTK introduces an improvement as the distributional similarity that acts
as a smoothing factor between lexical nodes better discriminates uncorrelated words,
like motorcycle and soccer. However, ambiguous words such verbs ride and play are
still promoting a similarity that is locally misleading. Notice that both PTK and SPTK
receive a strong contribution in the recursive computation of the kernels by the left
branching of the tree, as the subject is the same, i.e. man. Compositional information
about direct objects (soccer vs. motorcycle) is better propagated by the CSPTKdilation
operator. Its final scores for the pair is 0:36, as semantic differences between the
sentences are emphasized. Even if grammatical types strongly contribute to the final score
(as in PTK or SPTK), now the DCS computation over intermediate nodes (i.e. the VPs
(ride, motorcycle) and (play, soccer) is faced with less ambiguity with corresponding
lower scores. This improvement is even more noticeable in a supervised setting where
the feature vector derived with the LinBoW kernel is first enriched with the PTK scores,
and then with the score of the SPTK and CSPTK. Correlation of the latter kernel is
valuable with respect to the pure lexical overlap, resulting in the best results of 0.66.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>
        In this work, a model for compositionality in natural language within a kernel
formulation has been introduced. It accounts for structural (i.e. grammatical) and lexical
properties by adopting vector representations as models for distributional semantics
and extending previously proposed tree kernels. As an improvement to previous
extensions of tree kernels (e.g. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]), the smoothing here proposed applies in a
contextaware fashion throughout the parse tree, and is not only focusing on lexical nodes. The
Compositionally Smoothed Partial Tree Kernel thus corresponds to a Convolution
Kernel that measures the semantic similarity between complex linguistic structures, where
local matching at sub-trees is sensible to head-modifier compositional properties. The
results obtained by applying CSPTK across different NLP tasks, such as Question
Classification and Semantic Text Similarity, confirm the robustness and the generalization
capability of this kernel. Comparison against simpler lexical models or pure
grammatical kernels shows that systematically CSPTK is the best model. CSPTK can be also
easily designed against new domains (i.e. corpora and annotated datasets), but mainly
new compositional operators will be investigated, such as the Support Subspace [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Agirre</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Diab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez-Agirre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Semeval-2012 task 6: A pilot on semantic textual similarity</article-title>
          .
          <source>In: Proceedings SemEval</source>
          <year>2012</year>
          . Stroudsburg, PA, USA (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Annesi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Storch</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basili</surname>
          </string-name>
          , R.:
          <article-title>Space projections as distributional models for semantic composition</article-title>
          .
          <source>In: In Proceedings of CICLing 2012</source>
          . vol.
          <volume>7181</volume>
          , pp.
          <fpage>323</fpage>
          -
          <lpage>335</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Annesi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croce</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basili</surname>
          </string-name>
          , R.:
          <article-title>Semantic compositionality in tree kernels</article-title>
          . In: Li,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.S.</given-names>
            ,
            <surname>Garofalakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.N.</given-names>
            ,
            <surname>Soboroff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Suel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Wang</surname>
          </string-name>
          , M. (eds.)
          <source>Proc. of CIKM</source>
          <year>2014</year>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Baroni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Distributional memory: A general framework for corpus-based semantics</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>36</volume>
          (
          <issue>4</issue>
          ),
          <fpage>673</fpage>
          -
          <lpage>721</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Baroni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zamparelli</surname>
          </string-name>
          , R.:
          <article-title>Nouns are vectors, adjectives are matrices: representing adjective-noun constructions in semantic space</article-title>
          .
          <source>In: Proceedings of the EMNLP</source>
          <year>2010</year>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caputo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Semeraro</surname>
          </string-name>
          , G.:
          <article-title>Encoding syntactic dependencies by vector permutation</article-title>
          .
          <source>In: Proc. of the EMNLP 2011 Workshop on GEMS</source>
          . pp.
          <fpage>43</fpage>
          -
          <lpage>51</lpage>
          . Edinburgh, UK (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Coecke</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sadrzadeh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Mathematical foundations for a compositional distributional model of meaning</article-title>
          .
          <source>CoRR abs/1003</source>
          .4394 (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Collins,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Duffy</surname>
          </string-name>
          , N.:
          <article-title>Convolution kernels for natural language</article-title>
          .
          <source>In: Proceedings of Neural Information Processing Systems (NIPS)</source>
          . pp.
          <fpage>625</fpage>
          -
          <lpage>632</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Copestake</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flickinger</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pollard</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sag</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <source>Minimal Recursion Semantics: An Introduction. Research on Language &amp;amp; Computation</source>
          <volume>3</volume>
          (
          <issue>2-3</issue>
          ),
          <fpage>281</fpage>
          -
          <lpage>332</lpage>
          (
          <year>Dec 2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Croce</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moschitti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basili</surname>
          </string-name>
          , R.:
          <article-title>Structured lexical similarity via convolution kernels on dependency trees</article-title>
          .
          <source>In: Proceedings of EMNLP. Edinburgh</source>
          , Scotland, UK. (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Croce</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Annesi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Storch</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basili</surname>
          </string-name>
          , R.: Unitor:
          <article-title>Combining semantic text similarity functions through sv regression</article-title>
          .
          <source>In: Proceedings of SemEval 2012</source>
          . pp.
          <fpage>597</fpage>
          -
          <lpage>602</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Croce</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Previtali</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Manifold learning for the semi-supervised induction of framenet predicates: An empirical investigation</article-title>
          .
          <source>In: Proceedings of GEMS</source>
          <year>2010</year>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Erk</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pado</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A structured vector space model for word meaning in context</article-title>
          .
          <source>In: Proceedings of EMNLP 2008</source>
          . pp.
          <fpage>897</fpage>
          -
          <lpage>906</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Foltz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kintsch</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>The measurement of textual coherence with latent semantic analysis</article-title>
          .
          <source>Discourse Processes</source>
          <volume>25</volume>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Grefenstette</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sadrzadeh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Experimental support for a categorical compositional distributional model of meaning</article-title>
          .
          <source>CoRR abs/1106</source>
          .4058 (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Haussler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Convolution kernels on discrete structures</article-title>
          .
          <source>Tech. rep.</source>
          , Dept. of Computer Science, University of California at Santa Cruz (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerber</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hermjakob</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ravichandran</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Toward semantics-based answer pinpointing</article-title>
          .
          <source>In: Proc. of HLTR</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Q.</surname>
          </string-name>
          :
          <article-title>An approach based on tree kernels for opinion mining of online product reviews</article-title>
          .
          <source>In: Proceedings of ICDM 2010</source>
          . pp.
          <fpage>256</fpage>
          -
          <lpage>265</lpage>
          (
          <year>Dec 2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A solution to platos problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge</article-title>
          .
          <source>Psychological</source>
          review pp.
          <fpage>211</fpage>
          -
          <lpage>240</lpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Learning question classifiers</article-title>
          .
          <source>In: Proceedings of ACL '02</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . COLING '
          <volume>02</volume>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21. Mitchell, J.,
          <string-name>
            <surname>Lapata</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Vector-based models of semantic composition</article-title>
          .
          <source>In: In Proceedings of ACL/HLT 2008</source>
          . pp.
          <fpage>236</fpage>
          -
          <lpage>244</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Moschitti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quarteroni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basili</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manandhar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Exploiting syntactic and shallow semantic kernels for question answer classification</article-title>
          .
          <source>In: ACL</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Pado</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lapata</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Dependency-based construction of semantic space models</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>33</volume>
          (
          <issue>2</issue>
          ) (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Rudolph</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giesbrecht</surname>
          </string-name>
          , E.:
          <article-title>Compositional matrix-space models of language</article-title>
          .
          <source>In: Proceedings of ACL 2010</source>
          . pp.
          <fpage>907</fpage>
          -
          <lpage>916</lpage>
          . Stroudsburg, PA, USA (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Sahlgren</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The Word-Space Model</article-title>
          .
          <source>Ph.D. thesis</source>
          , Stockholm University (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26. Schu¨tze, H.:
          <article-title>Automatic word sense discrimination</article-title>
          .
          <source>Comput. Linguist</source>
          .
          <volume>24</volume>
          (
          <issue>1</issue>
          ),
          <fpage>97</fpage>
          -
          <lpage>123</lpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Smolensky</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Tensor product variable binding and the representation of symbolic structures in connectionist systems</article-title>
          .
          <source>Artif. Intell</source>
          .
          <volume>46</volume>
          ,
          <fpage>159</fpage>
          -
          <lpage>216</lpage>
          (
          <year>November 1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Thater</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Fu¨rstenau, H.,
          <string-name>
            <surname>Pinkal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Contextualizing semantic representations using syntactically enriched vector models</article-title>
          .
          <source>In: Proceedings of ACL 2010</source>
          . pp.
          <fpage>948</fpage>
          -
          <lpage>957</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Turney</surname>
            ,
            <given-names>P.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pantel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>From frequency to meaning: Vector space models of semantics</article-title>
          .
          <source>Journal of artificial intelligence research 37</source>
          ,
          <volume>141</volume>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.N.</given-names>
          </string-name>
          :
          <article-title>Statistical Learning Theory</article-title>
          . Wiley-Interscience (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E.M.:</given-names>
          </string-name>
          <article-title>Overview of the trec 2001 question answering track</article-title>
          .
          <source>In: TREC</source>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Zanzotto</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Korkontzelos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fallucchi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manandhar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Estimating linear models for compositional distributional semantics</article-title>
          .
          <source>In: Proceedings of COLING</source>
          <year>2010</year>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>