A Compositional Perspective in Convolution Kernels Roberto Basili1 , Paolo Annesi1 , Giuseppe Castellucci2 , and Danilo Croce1 1 Department of Enterprise Engineering, University of Roma Tor Vergata, Roma Italy {basili,annesi,croce}@info.uniroma2.it, 2 Department of Electronic Engineering, University of Roma Tor Vergata, Roma Italy castellucci@ing.uniroma2.it Abstract. Kernel-based learning has been largely adopted in many semantic tex- tual inference tasks. In particular, Tree Kernels (TKs) have been successfully ap- plied in the modeling of syntactic similarity between linguistic instances in Ques- tion Answering or Information Extraction tasks. At the same time, lexical seman- tic information has been studied through the adoption of the so-called Distribu- tional Semantics (DS) paradigm, where lexical vectors are acquired automatically from large-scale corpora. Recently, Compositional Semantics phenomena aris- ing in complex linguistic structures have been studied in an extended paradigm called Distributional Compositional Semantics (DCS), where, for example, alge- braic operators on lexical vectors have been defined to account for grammatically typed bi-grams or complex verb or noun phrases. In this paper, a novel kernel called Compositionally Smoothed Partial Tree Kernel is presented to integrate DCS operators into the tree kernel evaluation by also considering complex compositional nodes. Empirical results on well-known NLP tasks show that state-of-the-art performances can be achieved, without resorting to manual feature engineering, thus suggesting that a large set of Web and text mining tasks can be handled successfully by this kernel. 1 Introduction Convolution Kernels [16] are well-known similarity functions among complex syntac- tic and semantic structures. They are largely used to solve natural language related tasks, such as Opinion Mining [18] in recommender systems, Semantic Textual Simi- larity [11] in retrieval, or Question Classification [22] in question answering systems. In particular, Tree Kernels (TKs) introduced in [8] are used for their ability in captur- ing syntactic information, directly from syntactic parse trees. When training supervised learning algorithms, they are a valid approach to avoid the difficulty of manually de- signing effective features from linguistic structures. They define a similarity measure between data points (i.e. trees) in terms of all possible shared substructures. In order to take into account the lexical generalization provided from the analysis of large scale document collections, i.e. Distributional Semantic Models [19, 26, 25], a recent formulation of these kernel functions, called Smoothed Partial Tree Kernel (SPTK), has been introduced in [10]. The semantic information related to the lexical nodes of a parse tree is exploited for matching sub-structures characterized by different 2 but (semantically) related lexical items: it means that the similarity function between sub-trees depends also on a distributional metric between vectors representing the se- mantic of words. The main limitations of this approach are that a) lexical semantic in- formation only relies on the vector metrics applied to the nodes in a context free fashion and b) the semantic composition between words is neglected in the kernel computation, that only depends on their grammatical labels. In our perspective, semantic information is expressed by all nodes of the parse tree where specific compositions between heads and modifiers are shown. For example in the sentence, “What instrument does Hendrix play?”, the correct meaning of the verb play is fully captured if its modifiers, i.e. the noun instrument, is taken into account. This corresponds to model the contribution of the lexical item instrument as a function of the corresponding direct object relation (dobj) between play and instrument: we can say that instrument contributes to the sentence semantics by filling compositionally the slot dobj of its head play. In other words it seems that lexical semantic information should propagate over the entire parse tree, by making the compositionality phenomena of the sentence explicit. In recent years, Distributional Compositional Semantic (DCS) metrics have been proposed in order to combine lexical vector representations by alge- braic operators defined in the corresponding distributional space [21, 13, 32, 4]. The purpose of this paper is to cast the use of words within the compositional re- lationships exposed by a tree in order to make it sensible to the underlying complex constituents. Even when the syntactic structure does not change, such as in “play an instrument” vs. “play a match”, the contribution of the lexical information (e.g. the semantics of the verb “play”) must be correspondingly different when estimating the kernel function with a phrase like “win a game”. In traditional kernel computations, the similarity between play and win are context-free and the information of the direct object is neglected. The ideas underlying the Compositionally Smoothed Partial Tree Kernel (CSPTK), proposed in [3], are to i) use the SPTK formulation in order to exploit the lexical information encoded in a tree, ii) define a procedure to encode head-modifier information in the parse tree nodes in order to estimate the lexical similarity between sub-trees, iii) apply compositional distributional metrics to enforce a context-dependent estimation of the similarity of individual head-modifier information at the nodes. In Section 2 a summary of the approaches for DCS and TKs is presented. In Section 3, a lexical mark-up method for parse trees is described to support the CSPTK formu- lation, presented in Section 4. Finally, Section 5 reports the empirical evaluation of the CSPTK over Question Classification and Paraphrase Identification tasks. 2 Related Work This section reports a summary of the approaches for Distributional Compositional Se- mantics and Convolution Kernels. Distributional Compositional Semantics. Vector-based models typically represent iso- lated words and ignore grammatical structure [29] (a noticeable exception is [23]). They have thus a limited capability to model compositional operations over phrases and sen- tences. Distributional methods have been recently extended to better account for compo- sitionality, in the so-called Distributional Compositional Semantics (DCS) approaches. 3 Existing models are still controversial and provide general algebraic operators over lex- ical vectors and sophisticated composition methods. In [27] compositionality of two vector u and v is accounted by the tensor product u ⊗ v, while in [14] lexical vec- tors are summed, keeping the resulting vector with the same dimension of the latter. In [21] a semantic composition is seen as a function p = f (u, v, R, K) of two vectors u and v, their syntactic relation R and the factor K, defined as any additional knowledge or information which is needed to construct the semantics of their composition. This perspective clearly leads to a variety of efficient yet shallow models of compositional semantics. Two simplified models are derived from these general forms: i) the additive model p+ = αu + βv, where the weights α and β are applied the lexical vectors; ii) the multiplicative model p· = u v, where the symbol represents the point-wise multiplication, i.e. pi = ui · vi . In [21] a more complex asymmetric type of function called dilation, i.e. pd = (u · u)v + (λ − 1)(u · v)u, is introduced. In [13], the concept of a structured vector space is introduced, where each word is associated to a set of vectors corresponding to different syntactic dependencies. Ev- ery word is expressed by a tensor, and tensor operations are imposed. In [28] a similar geometric approach over a space of second-order distributional vectors is presented. Vectors represent the words typically co-occurring with the contexts in which the target word appears. A parallel strand of research also seeks to represent the meaning of larger compositional structures using matrix and tensor algebra, like in [27] and in [24]. In [5] and [7] composition is characterized from formal semantics in terms of a function ap- plication, where the distributional representation of one element in a composition (the functor) is not a vector but a function. Moreover, a hybrid Logic-Distributional Model is presented in [15], whilst an approach based on vector permutation and Random In- dexing technique, i.e. [25], is presented in [6]. In [2] composition is expressed via a projection operation into a subspace, i.e. a subset of the original features of a syntacti- cally motivated compound. A projection is a mapping (a selection function) over the set of features more representative of the compound: a subspace local to the (play,guitar) phrase can be found such that only the features specific to its meaning are selected. The resulting subspace of a compound should thus select the most appropriate word senses, neglecting the irrelevant ones and preserving the compositional semantics of the phrase. Convolution Tree Kernels. Convolution Kernels, introduced in [16], determine a sim- ilarity function among discrete structures, such as sequences, strings or trees. They are representationally efficient ways to encode similarity metrics able to support su- pervised learning algorithms for complex textual inferences (e.g. semantic role clas- sification). In particular Tree Kernels (TKs) allow estimating the similarity among texts, directly from the sentence syntactic structures, represented by trees, e.g. the de- pendency parse trees. A TK computes the number of substructures (as well as their partial fragments) shared by two parse trees T1 and T2 . For this purpose, let the set F = {f1 , f2 , . . . , f|F | } be a space of tree fragments and χi (n) be an indicator function: P fi is rooted it is 1 if the target P at node n and 0 otherwise. A tree-kernel is a function T K(T1 , T2 ) = n1 ∈NT n2 ∈NT2 ∆(n1 , n2 ), where NT1 and NT2 are the sets of the 1 P|F | T1 ’s and T2 ’s nodes and ∆(n1 , n2 ) = i=1 χi (n1 )χi (n2 ). The ∆ function recursively compute the amount of similarity derived from the number of common substructures. The type of considered fragments determines the expressiveness of the kernel space and 4 different tree kernels are characterized by different choices. In particular, the Smoothed Partial Tree Kernels, as discussed in [10], measure the similarity between syntactic tree structures, which are semantically related, i.e. partially similar, even when nodes, e.g. words at the leaves, differ. This is achieved by the following formulation of the function ∆ over nodes ni ∈ Ti : ∆σ (n1 , n2 )=µλσ(n1 , n2 ), if n1 and n2 are leaves, else  l(I 1 )  2 X d(I 1 )+d(I 2 ) Y ∆σ (n1 , n2 )=µσ(n1 , n2 ) λ + λ ∆σ (cn1 (I 1j ), cn2 (I 2j )) (1) I 1 ,I 2 ,l(I 1 )=l(I 2 ) j=1 Here, I 1j represents the sequence of subtrees, dominated by node n1 , that are shared with the children of n2 (i.e. I 2j ): all other non-matching substructures are neglected3 . The novelty of SPTK is represented by the introduction of a similarity function σ be- tween nodes, which are typed according to syntactic categories, pos-tags or lexical entries. A boolean measure of similarity is applied between non lexical nodes by as- signing 1 for identical syntactic or POS nodes, 0 otherwise. For lexical nodes the cosine similarity is applied between words sharing the same pos-tag. One main limitation of SPTK is that lexical similarity does not consider compositional interaction between words. Given the phrase pairs (np (nn river)(nn bank)) and (np (nn savings)(nn bank)), the SPTK would estimate the similarity by relying on a unique meaning for bank, that is wrong whenever one considers the compositional role of modifiers, river vs. savings. 3 Making lexical composition explicit in parse trees In this Section, a way of representing compositional information is discussed as a label- ing of individual nodes in the tree. The result is an enriched representation, on which Distributional Compositional Operators can be applied while evaluating the SPTK func- tion, i.e. resulting in a Compositionally Smoothed Partial Tree Kernel (CSPTK). Compositional semantic constraints over a tree kernel computation can be applied only when individual syntagms corresponding to nodes are made explicit. The CSPTK takes as input two trees where nodes express a lexical composition: this is used in the recursive compositional matching foreseen by the underlying convolution model, i.e. the same as in SPTK. Given the question “What instrument does Hendrix play?” and its dependency structure, the relative syntactic structure is shown in Figure 1 in terms of a Lexically Centered Tree (LCT), as proposed in [10]. Nodes in LCTs can be partitioned into: lexical nodes, i.e. nodes n ∈ N T , repre- senting one lexical item, in terms of hln ::posn i, such as instrument::n or play::v, where l is the lemma of the token and pos its part-of-speech4 ; terminal nodes, n ∈ T , that are the children of lexical nodes which encode a dependency function d ∈ D (e.g. nsubj) or the pos-tag of the node father (e.g. NN). An additional tree can be derived by emphasizing the importance of syntactic in- formation in the so-called Grammatical Relation Centered Tree (GRCT) in Figure 3, 3 Two decay factors, µ and λ, are introduced, respectively for the height of the tree and for the length of the child sequences. 4 General POS tags are obtained from the PennTreebank standard by truncating at their first char (as in instrument::n). 5 where nodes are partitioned into: syntactic nodes, i.e. nodes n ∈ N T , that encode de- pendency functions d ∈ D (e.g. nsubj); pre-terminal nodes, n ∈ PT , that encode pos-tag of each lexical item that is assigned to a leaf; lexical nodes, i.e. nodes n ∈ T , representing one lexical item, in terms of hln ::posn i. We aim at introducing lexical compositionality in these syntactic representation, through the explicit mark-up of every compositional (sub)structure. Notice how gram- matical dependencies in a LCT representation are encoded as direct descendants of a modifier non terminal. For example, the dependency dobj is a direct descendant of the lexical node instrument::n and expresses its grammatical role in the relationship with its parent node play::v. We propose to mark up such lexical nodes with the full descrip- tion of the corresponding head/modifier relationship, denoted by (h, m). In order to emphasize the semantic contribution of this relationship, the lexical information about the involved head (lh ) and modifier (lm ) must be represented: we denote it through a 4-tuple hlh ::posh ,lm ::posm i. Therefore, in an extended (compositional representation for LCTs) every non terminal n ∈ N T is marked as hdh,m , hlh :: posh , lm :: posm ii (2) where dh,m ∈ D is the dependency function between h and m and li and posi are lexical entries, and pos-tags. Moreover, one lexical node is also added to represent the simple lexical information of the involved modifier, in order to limit data sparseness. Notice that this mark-up resembles closely the representation of immediate dominance rules for grammatical phrases in Attribute Value Matrices (AVM) expressing feature structures in HPSG-like formalisms [9]. Figure 2 shows a fully compositional labeled tree for the sentence whose unla- beled version has been shown in Figure 1. Now, non terminal nodes represent compo- sitional lexical pairs of type head/modifier marked as in Eq. 2. For example, the node dobjhplay::v,instrument::ni express the direct object relation between the verb to play and the object instrument, as shown in Fig. 2. Dependency functions (dobj) and POS- Tags (VBZ) are encoded in terminal nodes as in the LCT before. Eventually, simple lexical nodes, e.g. play::v, are added as terminal nodes. play::v instrument::n do::v Hendrix::n VB root what::w NN dobj VBZ aux NNP nsubj WDT det Fig. 1. Lexically centered tree of the sentence “What instrument does Hendrix play?” In a similar fashion, the GRCT in Figure 3 can be extended in the Compositional counterpart, i.e. the CGRCT in Figure 4. In this case non terminal nodes represent compositional lexical pairs of type head/modifier marked as in Eq. 2; pos-tags are encoded in pre-terminal nodes as in the GRCT before; finally, simple lexical nodes, e.g. play::v, are preserved as terminal nodes. Every compositional node supports the application of a compositional distributional semantic model, where lexical entries for 6 roothplay::v,*::*i dobjhplay::v,instrument::ni auxhplay::v,do::vi nsubjhplay::v,Hendrix::ni root VB play::v dethinstrument::n,what::wi dobj NN instrument::n aux VBZ do::v nsubj NNP Hendrix::n det WDT what::w Fig. 2. Compositional Lexical Centered Tree (CLCT) of “What instrument does Hendrix play?” heads (lh :: posh ) and modifiers (lm :: posm ) correspond to unique vectors. Given two subtrees T1 and T2 , rooted at nodes n1 and n2 and the corresponding head-modifier pairs p1 = (h1 , m1 ) and p2 = (h2 , m2 ), a shallow compositional function, independent from any dependency relation dh,m , is defined over non terminal nodes, by adopting one of the DCS models discussed in Section 2 so that: σComp (h1 , m1 ), (h2 , m2 ) =Φ◦ (h1 , m1 ), (h2 , m2 )   (3) Hereafter, the application of the DCS models to the SPTK kernel computation is discussed. root dobj aux nsubj VB det NN VBZ NNP play::v WDT instrument::n do::v Hendrix::n what::w Fig. 3. Grammatical Relation Centered Tree (GRCT) of “What instrument does Hendrix play?” roothplay::v,*::*i dobjhplay::v,instrument::ni auxhplay::v,do::vi nsubjhplay::v,Hendrix::ni VB dethinstrument::n,what::wi NN VBZ NNP play::v WDT instrument::n do::v Hendrix::n what::w Fig. 4. Compositional Grammatical Relation Centered Tree (CGRCT) of “What instrument does Hendrix play?” 4 The Compositionally Smoothed Partial Tree Kernel When the compositionality of individual lexical constituents is made explicit at the nodes, a compositional measure of similarity between entire parse trees can be com- puted through a Tree Kernel. We define here such a similarity function as the Composi- tionally Smoothed Partial Tree Kernel (CSPTK), by extending the SPTK formulation. Let us consider the application of a SPTK to a tree pair represented in their CLCT 7 form: for example, let us consider the trees derived from sentences such as “Hendrix plays guitar” and “Hendrix plays football”. Although the kernel recursively estimates the similarity among all nodes through the σ function in Equation 1, it would not be able to distinguish different senses for the verb to play, as it is applied on a unique distributional vector. The contribution of the lexical information (e.g. the vector for the verb to play) must be correspondingly dependent from the other words in the sentence. Algorithm 1 στ (nx , ny , lw) Compositional estimation of the lexical contribution in the CSPTK στ ← 0, /*Matching between simple lexical nodes*/ if nx = hlexx ::posi and ny = hlexy ::posi then στ ← σLEX (nx , ny ) end if /*Matching between identical grammatical nodes, e.g. POS tags*/ if (nx = pos or nx = dep) and nx = ny then στ ← lw end if if nx = dh,m , hlix i and ny = dh,m , hliy i then /*both modifiers are missing*/ if lix = hhx ::posi and liy = hh  y ::posi then στ ← σComp (hx ), (hy ) = σLEX (nx , ny ) end if /*one modifier is missing*/ if lix = hhx ::posh i and liy = hhy ::pos  h , my ::posm i then στ ← σComp (hx , hx ), (hy , my ) end if /*the general case*/ if lix = hhx ::posh , mx ::posm i and liy = hhy ::posh , my ::posm i then  στ ← σComp (hx , mx ), (hy , my ) end if end if return στ The core novelty of the CSPTK is thus a new estimation of στ as described in Algorithm 1. Notice that for simple lexical nodes, a lexical kernel σLEX , such as the cosine similarity between vectors of words sharing the same POS-tag, is applied. More- over, the other non lexical nodes contribute according to a strict matching policy: they provide full similarity only when the same POS, or dependency, is matched and 0 oth- erwise. The novel part of Algorithm 1 correspond to the treatment of compositional nodes. The similarity function σComp between such nodes is computed only when they exhibit the same dh,m and the respective heads and modifiers share the POS label: then, a compositional metric is applied over the two involved (h, m) compounds. The met- ric depends on the involved compounds that can be only partially instantiated, so that different strategies must be applied: – General case. When fully instantiated compounds (h1 , m1 ) and (h2 , m2 ) are avail- able, the similarity function of Equation 3 is applied as usual. Notice that the POS- tags of heads and modifiers must be pairwise identical. – Both modifiers are missing. When modifiers mi are unexpressed in both trees, no composition is observed: a lexical kernel σLEX , i.e. the cosine similarity between unique word vectors h1 and h2 , is applied. 8 – One modifier is missing. Let be (h1 , ∗) and (h2 , m2 ) the lexical information of the two nodes n1 (that lacks of the modifier) and n2 . Here, the general similarity function is applied to the pair (h1 , h1 ) and the pair (h2 , m2 ), so that no specific lexical semantic constraint is applied to the head h1 . As formally described in Algorithm 1, other cases are discarded, e.g. heads are unknown or different POS tags are observed for the heads hi of the n1 and n2 nodes. The factor lw is here adopted to reduce the contribution of non lexical nodes. 5 Experimental Evaluations In order to study the behavior of the proposed compositional kernels onto the automa- tion of semantic inference over texts, we carried out experiments over two fine grained linguistic tasks, i.e. Question Classification (QC) and Semantic Text Similarity (STS). In all experiments, texts are processed with the Stanford CoreNLP5 and the resulting dependency trees are extended with compositional information as discussed in Section 3. The lexical similarity function σLEX exploited in SPTK and CSPTK through the Algorithm 1 was based on a distributional analysis of the UkWaC corpus, that gives rise to a word vector representation for all words occurring more than 100 times (i.e. the targets), as discussed in [12]. The vector dimensions correspond to co-occurrences within a left or right window of 3 tokens between the targets and the set of the 20,000 most frequent words (i.e. features) in UkWaC. Scores correspond to point-wise Mutual Information values between each feature and a target6 across the entire corpus. The Singular Value Decomposition is then applied to the input matrix with a space dimen- sionality cut at k = 250. The similarity σLEX required by Algorithm 1 corresponds to the cosine similarity in the resulting space, as in [10]. 5.1 Question Classification Question Classification (QC) is the task of assigning one (or more) class label(s) to a given question written in natural language. Question classification plays an important role in Question Answering (QA) [31], as it has been shown that its performance signif- icantly influences the overall performance of a QA system, e.g. [17]. Thanks to avail- ability of our kernel similarity models, the QC task has been modeled directly in terms of the parse trees representing questions, i.e. the classification objects. The reference corpus is the UIUC dataset [20], including 5,452 questions for training and 500 ques- tions for test7 , organized in six coarse-grained classes, i.e. ABBREVIATION, ENTITY, DESCRIPTION, HUMAN, LOCATION and NUMBER. SVM training has been carried out by applying (i) the PTK and SPTK kernels over the LCT and GRCT representations and (ii) the compositional tree kernels (CSPTKs), according to different compositional similarity metrics σLEX , to the CLCT and CGRCT formats. For learning our models, we used the Kernel-based Learning Platform8 , which includes structural kernels, i.e., 5 http://nlp.stanford.edu/software/corenlp.shtml 6 Left contexts of targets are treated separately from the right ones so that 40,000 features are used derived for each target. 7 http://cogcomp.cs.illinois.edu/Data/QA/QC/ 8 http://sag.art.uniroma2.it/demo-software/kelp/ 9 STK and PTK: i) with the smooth match between tree leaves, i.e. the SPTK defined in [10] and ii) with the smooth match between compositionally labeled nodes of the tree, as in the CSPTK. Different tree representations are denoted in the subscripts, so that CSPTKCLCT refers to the use of a CSPTK over the compositionally labeled LCT (as in Fig. 2). As the compositional kernel has to make use of a compositional similarity metrics, the notation of different kernels is specialized according to the adopted met- rics. Additive (with α = β), dilation (with λ = 1) and point-wise multiplication models have been adopted and denoted by + , d and · superscripts respectively. The accuracy achieved by the different systems is the percentage of sentences that are correctly assigned to the proper question class, and it is reported in Table 1. The pa- rameter of the SVM has been estimated in a held-out subset sampled from the training material. As a baseline, a simple Linear kernel (LIN) over a bag-of-word model (de- noted by BoW) is reported (row 1) representing questions as binary word vectors and resulting in a lexical overlap kernel. A first observation is that the introduction of lexical semantic information in tree kernel operators, such as in SPTK vs. PTK, is beneficial, confirming the outcomes of previous studies, [10]. Moreover, the compositional kernels proposed (CSPTK) seem to make an effective use of the compositional distributional se- mantic information as they all outperform their non compositional counterpart. Among the compositional kernels the ones adopting the simple sum operator, i.e. CSPTK+ CLCT , seem to outperform all the other compositional operators. This remains true indepen- dently from the type of tree adopted, i.e. LCT vs. GRCT. The Lexically Centered Tree seems to better exploit compositional information, as it achieves, on average, higher accuracy than GRCT, given that it emphasizes lexical nodes that occur closer to the tree root an thus contribute mostly to the kernel recursive computation. Kernel Pearson Kernel Accuracy PTKLCT 0.291 LINBOW 86,8% PTKGRCT 0.118 PTKLCT 90,4% SPTKLCT 0.270 Unsupervised PTKGRCT 89,8% SPTKGRCT 0.102 SPTKLCT 94,0% CSPTK+ 0.370 CLCT SPTKGRCT 91,4% CSPTK+ 0.400 CGRCT CSPTK+ CLCT 94,8% CSPTK·CLCT 0.316 CSPTK+ CGRCT 94,0% CSPTK·CGRCT 0.341 CSPTK·CLCT 93,6% CSPTKdCLCT 0.432 CSPTK·CGRCT 93,6% CSPTKdCGRCT 0.454 CSPTKdCLCT 93,6% LINBOW 0.537 CSPTKdCGRCT 92,8% LINBOW + PTKLCT 0.578 Regression Table 1. Results in the QC task LINBOW + PTKGRCT 0.630 LINBOW + PTKLCT + SPTKLCT 0.577 LINBOW + PTKGRCT + SPTKLCT 0.636 LINBOW + PTKLCT + CSPTKdCLCT 0.590 LINBOW + PTKGRCT + CSPTKdCGRCT 0.660 Table 2. Results in the STS task: unsupervised and supervised settings. 10 5.2 The Semantic Text Similarity task The second experiment aims at evaluating the contribution of the different kernels in estimating a semantic similarity between text pairs. A meaningful similarity function between texts is crucial in many NLP and Retrieval tasks such as Textual Entailment, Paraphrasing, Search Re-ranking, Machine Translation, Text Summarization or Deep Question Answering. Kernel functions are evaluated against the dataset of the Semantic Text Similarity (STS) task built in SemEval 2012 [1], in particular the MSR Video Paraphrase (MSRvid) dataset . This dataset is composed of 1,500 sentence pairs (split in 750 train and 750 test pairs). Each sentence is a short description of a video and the pairs have been manually labeled through Amazon Mechanical Turk with a numerical score (between 0 and 5) reflecting the similarity between the actions and actors involved in the video. As an example the sentences “A woman is slicing an onion” and “A woman is cutting an onion” have a similarity of 4.2, while “A man is talking” and “A man is walking” have a similarity of 1. The task consists in assigning a similarity score to each test pair and the performance is measured in terms of the Pearson Correlation between the obtained scores and the gold-standard ones, as reported in Table 2. The same kernels and syntactic structures of the previous experiments have been considered. The kernel functions have been applied to the STS task both in an unsu- pervised as well as a supervised fashion. In the first case, as shown in the top rows of Table 2, each kernel has been evaluated by measuring the Pearson correlation between the result of the kernel function between each text pair and the human score. In the supervised setting, shown in the bottom rows of Table 2, the kernel scores are used to populate a feature vector used to train a SVM based regressor [30], which determines the best kernel combination approximating the human scores over the test material. The results achieved by the single kernels suggests that the semantic similarity in the MSRVid dataset strongly depends on the lexical overlap between sentence pairs, and a regressor trained with a linear kernel over the Bag-of-Word representations (LinBoW ) achieve a correlation of 0.537. This result is higher with respect to the unsupervised application of the PTK where only full syntactic information is used; this is mainly due to the fact that most sentences are in the form Subject-Verb-Object and the syn- tactic information is not discriminative. The simple smoothing over lexical nodes in the SPTK is still not very informative, as the smoothing over the pure lexical information may be misleading. As an example, the sentence pair “A man plays a guitar” has a low similarity (i.e. 2.0) w.r.t. “A boy plays a piano” as they evoke different actions: while generalizing is useful in comparing boy and man, it can be misleading in comparing guitar and piano. Such smoothing is more controlled in the CSPTK that achieves better results. The above difference is mainly due to the increasing sensitivity of PTK, SPTK and CSPTK to the incrementally rich lexical information. This is especially evident in sentence pairs with very similar syntactic structure. For example a sentence pair is given by The man are playing soccer and A man is riding a motorcycle, that are strictly syntactically correlated. In fact, PTK provide a similarity score of 0.647 between the to sentences as differences between tree structures is confined to the leaves. By scor- ing 0.461, SPTK introduces an improvement as the distributional similarity that acts as a smoothing factor between lexical nodes better discriminates uncorrelated words, like motorcycle and soccer. However, ambiguous words such verbs ride and play are 11 still promoting a similarity that is locally misleading. Notice that both PTK and SPTK receive a strong contribution in the recursive computation of the kernels by the left branching of the tree, as the subject is the same, i.e. man. Compositional information about direct objects (soccer vs. motorcycle) is better propagated by the CSPTKdilation operator. Its final scores for the pair is 0.36, as semantic differences between the sen- tences are emphasized. Even if grammatical types strongly contribute to the final score (as in PTK or SPTK), now the DCS computation over intermediate nodes (i.e. the VPs (ride, motorcycle) and (play, soccer) is faced with less ambiguity with corresponding lower scores. This improvement is even more noticeable in a supervised setting where the feature vector derived with the LinBoW kernel is first enriched with the PTK scores, and then with the score of the SPTK and CSPTK. Correlation of the latter kernel is valuable with respect to the pure lexical overlap, resulting in the best results of 0.66. 6 Conclusions In this work, a model for compositionality in natural language within a kernel for- mulation has been introduced. It accounts for structural (i.e. grammatical) and lexical properties by adopting vector representations as models for distributional semantics and extending previously proposed tree kernels. As an improvement to previous ex- tensions of tree kernels (e.g. [10]), the smoothing here proposed applies in a context- aware fashion throughout the parse tree, and is not only focusing on lexical nodes. The Compositionally Smoothed Partial Tree Kernel thus corresponds to a Convolution Ker- nel that measures the semantic similarity between complex linguistic structures, where local matching at sub-trees is sensible to head-modifier compositional properties. The results obtained by applying CSPTK across different NLP tasks, such as Question Clas- sification and Semantic Text Similarity, confirm the robustness and the generalization capability of this kernel. Comparison against simpler lexical models or pure grammat- ical kernels shows that systematically CSPTK is the best model. CSPTK can be also easily designed against new domains (i.e. corpora and annotated datasets), but mainly new compositional operators will be investigated, such as the Support Subspace [2]. References 1. Agirre, E., Diab, M., Cer, D., Gonzalez-Agirre, A.: Semeval-2012 task 6: A pilot on semantic textual similarity. In: Proceedings SemEval 2012. Stroudsburg, PA, USA (2012) 2. Annesi, P., Storch, V., Basili, R.: Space projections as distributional models for semantic composition. In: In Proceedings of CICLing 2012. vol. 7181, pp. 323–335 (2012) 3. Annesi, P., Croce, D., Basili, R.: Semantic compositionality in tree kernels. In: Li, J., Wang, X.S., Garofalakis, M.N., Soboroff, I., Suel, T., Wang, M. (eds.) Proc. of CIKM 2014 (2014) 4. Baroni, M., Lenci, A.: Distributional memory: A general framework for corpus-based se- mantics. Computational Linguistics 36(4), 673–721 (2010) 5. Baroni, M., Zamparelli, R.: Nouns are vectors, adjectives are matrices: representing adjective-noun constructions in semantic space. In: Proceedings of the EMNLP 2010 (2010) 6. Basile, P., Caputo, A., Semeraro, G.: Encoding syntactic dependencies by vector permuta- tion. In: Proc. of the EMNLP 2011 Workshop on GEMS. pp. 43–51. Edinburgh, UK (2011) 7. Coecke, B., Sadrzadeh, M., Clark, S.: Mathematical foundations for a compositional distri- butional model of meaning. CoRR abs/1003.4394 (2010) 12 8. Collins, M., Duffy, N.: Convolution kernels for natural language. In: Proceedings of Neural Information Processing Systems (NIPS). pp. 625–632 (2001) 9. Copestake, A., Flickinger, D., Pollard, C., Sag, I.: Minimal Recursion Semantics: An Intro- duction. Research on Language & Computation 3(2-3), 281–332 (Dec 2005) 10. Croce, D., Moschitti, A., Basili, R.: Structured lexical similarity via convolution kernels on dependency trees. In: Proceedings of EMNLP. Edinburgh, Scotland, UK. (2011) 11. Croce, D., Annesi, P., Storch, V., Basili, R.: Unitor: Combining semantic text similarity func- tions through sv regression. In: Proceedings of SemEval 2012. pp. 597–602 (2012) 12. Croce, D., Previtali, D.: Manifold learning for the semi-supervised induction of framenet predicates: An empirical investigation. In: Proceedings of GEMS 2010 (2010) 13. Erk, K., Pado, S.: A structured vector space model for word meaning in context. In: Proceed- ings of EMNLP 2008. pp. 897–906 (2008) 14. Foltz, P., Kintsch, W., Landauer, T.: The measurement of textual coherence with latent se- mantic analysis. Discourse Processes 25 (1998) 15. Grefenstette, E., Sadrzadeh, M.: Experimental support for a categorical compositional distri- butional model of meaning. CoRR abs/1106.4058 (2011) 16. Haussler, D.: Convolution kernels on discrete structures. Tech. rep., Dept. of Computer Sci- ence, University of California at Santa Cruz (1999) 17. Hovy, E., Gerber, L., Hermjakob, U., Lin, C.Y., Ravichandran, D.: Toward semantics-based answer pinpointing. In: Proc. of HLTR. pp. 1–7 (2001) 18. Jiang, P., Zhang, C., Fu, H., Niu, Z., Yang, Q.: An approach based on tree kernels for opinion mining of online product reviews. In: Proceedings of ICDM 2010. pp. 256–265 (Dec 2010) 19. Landauer, T., Dumais, S.: A solution to platos problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review pp. 211– 240 (1997) 20. Li, X., Roth, D.: Learning question classifiers. In: Proceedings of ACL ’02. pp. 1–7. COL- ING ’02, Stroudsburg, PA, USA (2002) 21. Mitchell, J., Lapata, M.: Vector-based models of semantic composition. In: In Proceedings of ACL/HLT 2008. pp. 236–244 (2008) 22. Moschitti, A., Quarteroni, S., Basili, R., Manandhar, S.: Exploiting syntactic and shallow semantic kernels for question answer classification. In: ACL (2007) 23. Pado, S., Lapata, M.: Dependency-based construction of semantic space models. Computa- tional Linguistics 33(2) (2007) 24. Rudolph, S., Giesbrecht, E.: Compositional matrix-space models of language. In: Proceed- ings of ACL 2010. pp. 907–916. Stroudsburg, PA, USA (2010) 25. Sahlgren, M.: The Word-Space Model. Ph.D. thesis, Stockholm University (2006) 26. Schütze, H.: Automatic word sense discrimination. Comput. Linguist. 24(1), 97–123 (1998) 27. Smolensky, P.: Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artif. Intell. 46, 159–216 (November 1990) 28. Thater, S., Fürstenau, H., Pinkal, M.: Contextualizing semantic representations using syntac- tically enriched vector models. In: Proceedings of ACL 2010. pp. 948–957 (2010) 29. Turney, P.D., Pantel, P.: From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 37, 141 (2010) 30. Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience (1998) 31. Voorhees, E.M.: Overview of the trec 2001 question answering track. In: TREC (2001) 32. Zanzotto, F.M., Korkontzelos, I., Fallucchi, F., Manandhar, S.: Estimating linear models for compositional distributional semantics. In: Proceedings of COLING 2010 (2010)