Algebraic compositional models for semantic similarity in ranking and clustering Paolo Annesi, Valerio Storch, Danilo Croce and Roberto Basili Dept. of Computer Science, University of Roma Tor Vergata, Roma, Italy {annesi,croce,basili}@info.uniroma2.it storch@uniroma2.it Abstract. Although distributional models of word meaning have been widely used in Information Retrieval achieving an effective representation and generalization schema of words in isolation, the composition of words in phrases or sentences is still a challenging task. Different methods have been proposed to account on syntactic structures to combine words in term of algebraic operators (e.g. tensor product) among vectors that represent lexical constituents. In this paper, a novel approach for semantic composition based on space projection techniques over the basic geometric lexical representations is proposed. In the geometric perspective here pursued, syntactic bi-grams are projected in the so called Support Subspace, aimed at emphasizing the semantic features shared by the compound words and better capturing phrase-specific aspects of the involved lexical meanings. State-of-the-art results are achieved in a well known benchmark for phrase similarity task and the generalization capability of the proposed operators is in- vestigated in a cross-linguistic scenario, i.e. in the English and Italian Language. 1 Introduction With the rapid development of the World Wide Web and the spread of human- generated contents, Information Retrieval (IR) has many challenges in discover- ing and exploiting those rich and huge information resources. Semantic search [3] improves search precision and recall by understanding user’s intent and the con- textual meaning of concepts in documents and queries. Semantic search extends the scope of traditional information retrieval paradigms from mere document retrieval to entity and knowledge retrieval, improving the conventional IR meth- ods by looking at a different perspective, i.e. the meaning of words. However, the language richness and its intrinsic relation to the world and human activities make semantic search a very complex task. In a IR system, a user can express its specific user need with a natural language query like ”... buy a car ...”. This request can be satisfied by documents expressing the abstract concept of buying something and in particular the focus of the action is a car. This information can be expressed inside a document collection in many different forms, e.g. the quasi-synonymic expression ”... purchase an automobile ...”. Accounting on lexi- cal overlap with respect to the original query, a Bag-of-word based system would instead retrieve different documents, containing expressions such as ”... buy a bag ...” or ”... drive a car ...”. A proper semantic generalization is thus needed, in order to derive the correct composition of the target words, i.e. an action like buy and an object like car. While compositional approaches to language understanding have been largely adopted, semantic tasks are still challenging for research in Natural Language Processing. Traditional logic-based approaches (as the Montague’s approach in [17] and [2]) rely on Frege’s principle for which the meaning of a sentence is a function of the meanings of its parts [10]. The resulting theory allows an algebra on the discrete propositional symbols to represent the meaning of arbitrarily complex expressions. Despite the fact that they are formally well defined, logic- based approaches have limitations in the treatment of ambiguity, vagueness and cognitive aspects intrinsically connected to natural language. On the other hand, distributional models early introduced by Schütze [21] rely on the Word Space model. Here semantic uncertainty is managed through the statistical analysis of large scale corpora. Linguistic phenomena are then modeled according to a geometrical perspective, i.e. points in a high-dimensional space representing semantic concepts, such as words, and can be learned from corpora, in such a way that similar, or related, concepts are near each another in the space. Methods for constructing representations for phrases or sentences through vector composition has recently received a wide attention in literature (e.g. [15, 23]). However, vector-based models typically represent isolated words and ignore grammatical structure [23]. Such models have thus a limited capabil- ity to model compositional operations over phrases and sentences. In order to overcome these limitations a so-called compositional distribu- tional semantics (DCS) model is needed and its development is still object of on-going and controversial research (e.g. [5], [11]). A compositional model based on distributional analysis should provide semantic information consistent with the meaning assignment that is typical of human subjects. For example, it should support synonymy and similarity judgments on phrases, rather than only on single words. The objective should be a measure of similarity between quasi- synonymic complex expressions, such as ”... buy a car ...” vs. ”... purchase an automobile ...”. Another typical benefit should be a computational model for entailment, so that the representation for ” ... buying something ...” should be implied by the expression ”... buying a car ...” but not by ”... buying time ...”. Distributional compositional semantics (DCS) need thus a method to define: (1) a way to represent lexical vectors u and v, for words u, v dependent on the phrase (r, u, v) (where r is a syntactic relation, such as verb-object), and (2) a metric for comparing different phrases according to the selected representations u, v. Existing models are still controversial and provide general algebraic operators (such as tensor products) over lexical vectors. In this paper, we focus on the geometry of latent semantic spaces by propos- ing a novel distributional model for semantic composition. The aim is to model semantics of syntactic bigrams as projections in lexically-driven subspaces. Dis- tances in such subspaces (called Support Spaces) emphasize the role of common features that constraint in ”parallel” the interpretation of the involved lexical meanings and better capture phrase-specific aspects. In the following evaluations, operators will be employed to compose word pairs involved in specific syntactic structures. This resulting compositions will be evaluated according two different perspectives. First, similarity among compositions will be evaluated with respect to human annotators’ judgments. Then, the operators generalization capability will be measured in order to prove their applicability in semantic search complex systems. Moreover the robustness of this Support Spaces based will be confirmed in a cross-linguistic scenario, i.e. in the English and Italian Language. While Section 2 discusses existing methods of compositional distributional semantics, Section 3 presents our model based on support spaces. Experiments in Section 4 are used to show the beneficial impact of the proposed model and the contribution to semantic search systems. Finally, Section 5 derives the con- clusions. 2 Related work While compositional semantics allows to govern the recursive interpretation of sentences or phrases, traditional vector space models (as in IR [20]) and, mostly, semantic space models, such as LSA ([7, 13]), represent lexical information in metric spaces where individual words are represented according to the distri- butional analysis of their co-occurrences over a large corpus. Such models are based on the distributional hypotesis which assumes that words occurring within similar contexts are semantically similar (Harris in [12]). Semantic spaces have been widely used for representing the meaning of words or other lexical entities (e.g. [23]), with successful applications in lexical disam- biguation ([22]) or harvesting thesauri (as in Lin [14]). In this work we will refer to the so-called word-based spaces, in which words are represented by proba- bilistic information of their co-occurences calculated in a fixed range window over all sentences. In such models, vector components correspond to the entries f of the vocabulary V (i.e. to features that are individual words). Weigths are associ- ated with each component, using different estimators of their correlation. In some works (e.g. [15]) pure co-occurrence counts are adopted as weighting functions fi , where i = 1, ..., N and N = |V |; in other works (e.g. [18]), statistical func- tions like the pointwise mutual information between the target word w and the p(w,fi ) captured co-occurences in the window are used, i.e. pmi(w, i) = log2 p(w)·p(f i) . A vector w = (pmi1 , ..., pmiN ) models a word w and it is thus built over all the words fi belonging to the dictionary. When w and f never co-occur in any window their pmi is by default set to 0. Weights of vector components depend on the size of the co-occurrence window and express the global statistics in the entire corpus. Larger values of the adopted window size aim to capture topical similarity (as in the document based models of IR), while smaller sizes (usu- ally between the ±1-3 surrounding words) lead to representation better suited for paradigmatic similarities between word vectors w. Cosine similarity between hw1 ,w2 i vectors w1 and w2 is modeled as the normalized scalar product, i.e. kw 1 kkw2 k that expresses topical or paradigmatic similarity according to the different rep- resentations (e.g. window sizes). Notice that dimensionality reduction methods, such as LSA [7, 13] are also applied in some studies, to capture second order dependencies between features f , i.e. applying semantic smoothing to possibly sparse input data. Applications of an LSA-based representation to Frame Induc- tion or Semantic Role Labeling are presented in [19] and [6], respectively. The main limitation of distributional models of lexical semantic is their non- compositional nature: they are based on statistics related to the occurences of the individual words in the corpus. In such models, the semantic of topological similarity functions is thus defined only for the comparison between individ- ual words. That is the reason why distributional methods can not compute the meanings of phrases (and sentences) as effectively as they do indeed over in- dividual words. Distributional methods have been recently extended to better account compositionality, in the so called distributional compositional semantics (DCS) approaches. Mitchell and Lapata in [15] follow Foltz [9] and assume that the contribution of the syntactic structure can be ignored, while the meaning of a phrase is simply the commutative sum of the meanings of its constituent words. More formally, [15] defines the composition p◦ = u ◦ v of vectors u and v through an additive class of composition functions expressed by: p+ = u + v (1) This perspective clearly leads to a variety of efficient yet shallow models of compositional semantics compared in [15]. For example pointwise multiplication is defined by the multiplicative function: p· = u v (2) where the symbol represents multiplication of the corresponding components, i.e. pi = ui · vi . Point-wise multiplication seems to best correspond with the intended effects of syntactic interaction, as experiments in [15] demonstrate. In [8], the concept of a structured vector space is introduced, where each word is associated with a set of vectors corresponding to different syntactic dependencies. Every word is thus expressed by a tensor, and tensor operations are imposed. The main differences among these studies lies in (1) the lexical vector repre- sentation selected (e.g. some authors do not even commit to any representation, but generically refer to any lexical vector, as in [11]) as well as in (2) the adopted compositional algebra, i.e. the system of operators defined over such vectors. Generally, proposed operators do not depend on the involved lexical items, but a general purpose algebra is adopted. Since compositional structures are highly lexicalized, and the same syntactic relation triggers to very different semantic relations with respect to the different involved words, a proposal that makes the compositionality operators dependent on individual lexical vectors is hereafter discussed. 3 A quantitative model for compositionality In order to determine the semantic analogies and differences between two phrases, such as ”... buy a car ...” and ”... buy time ...”, a distributional compositional model is employed as follows. The involved lexicals are buy, car and time, while their corresponding vector representation will be denoted by wbuy wcar and wtime . The major result of most studies on DCS is the definition of the function ◦ that associates with wbuy and wcar a new vector wbuy car = wbuy ◦ wcar . We consider this approach misleading since vector components in the word space are tied to the syntactic nature of the composed words and the new vector wbuy car should not have the same type of the original vectors. Notice also that the components of wbuy and wcar express all their contexts, i.e. interpretations, and thus senses, of buy and car in the corpus. Algebric operations are thus open to misleading contributions, brought by not-null feature scores of buyi vs. carj (i 6= j) that may correspond to senses of buy and car that are not related to the specific phrase ”buy a car ”. On the contrary, in a composition, such as the verb- object pair (buy, car), the word car influences the interpretation of the verb buy and viceversa. The model here proposed is based on the assumption that this influence can be expressed via the operation of projection into a subspace, i.e. a subset of original features fi . A projection is a mapping (a selection function) over the set of all features. A subspace generated by a projection function Π local to the (buy, car) phrase can be found such that only the features specific to the phrase meaning are selected and the irrelevant ones are neglected. The resulting subspace has to preserve the compositional semantics of the phrase and it is called support subspace of the underlying word pair. Consider the bigram composed of the words Buy-Car Buy-Time buy and car and their vectorial representation cheap::Adj consume::V in a co-occurrence N −dimensional Word Space. insurance::N enough::Adj Table 1 reports the k = 10 features with the rent::V waste::V highest contributions of the point wise product lease::V save::In of the pairs (buy,car) and (buy,time). The sup- dealer::N permit::N port space thus selects the most important fea- motorcycle::N stressful::Adj tures for both words, e.g. buy.V and car.N. No- hire::V spare::Adj tice that this captures the conjunctive nature of auto::N save::V california::Adj warner::N the scalar product to which contributions come tesco::N expensive::Adj from feature with non zero scores in both vec- tors. It is clear that the two pairs give rise to dif- Table 1. Features correspond- ferent support subspaces: the main components ing to dimensions in the k=10 related with buy car refer mostly to the automo- dimensional support space of bile commerce area unlike the ones related with bigrams buy car and buy time buy time mostly referring to the time wasting or saving. Similarity judgments about a pair can be thus better computed within its support subspace. More formally k−dimensional support subspace for a word pair (u, v) (with k k  N ) is the subspace Pn spanned by the subset of n ≤ k indexes I (u, v) = {i1 , ..., in } for which t=1 uit · vit is maximal. Given two pairs the similarity between syntactic equivalent words (e.g. nouns with nouns, verbs with verbs) is measured in the support subspace derived by applying a specific projection function. Compositional similarity between buy car and the latter pairs (e.g. buy time) is thus estimated by (1) immersing wbuy and wtime in the selected ”. . . buy car . . . ” support subspace and (2) estimating similarity between corre- sponding arguments of the pairs locally in that subspace. Therefore the similarity between syntactic equivalent words (e.g. car with time) within these new sub- space is measured. Therefore given a pair (u, v), a unique matrix Mkuv = (mkuv )ij is defined for a given projection Π k (u, v) into the k-dimensional support space of any pair (u, v) according to the following definition: ( k 1 iff i = j ∈ Ik (u, v) (muv )ij = (3) 0 otherwise. The vector ũ projected in the support subspace can be thus estimated through the following matrix operation: ũ = Π k (u, v) ũ = Mkuv u (4) A special case of the projection matrix is given when no k limitation is imposed to the dimension and all the positive addends in the scalar product are taken. Notice also that two pairs p1 = (u, v) and p2 = (u0 , v 0 ) give rise to two different projections denoted by Mk1 and Mk2 and defined as: 0 0 (Left projection) Π1k = Π k (u, v) (Right projection) Π2k = Π k (u , v ) (5) k It is also possible to define a unique symmetric projection Π12 corresponding to k the combined matrix M12 as follows: Mk12 = (Mk1 + Mk2 ) − (Mk1 Mk2 ) (6) where the mutual components that satisfy Eq. 3 are employed as Mk12 . As Π1 is the projection in the support subspace for the pair p1 , it is possible to immerse the latter pair p2 by applying Eq. 4. This results in the two vec- 0 0 tors Mk1 u and the Mk1 v . It follows that a compositional similarity judgment between two phrase over the first pair support subspace can be expressed as: 0 0 (◦) hMk1 u, Mk1 u i hMk1 v, Mk1 v i Φ(◦) p1 (p1 , p2 ) = Φ1 (p1 , p2 ) = 0 ◦ (7) Mk1 u Mk1 u Mk1 v Mk1 v 0 where first cosine similarity between syntactically correlated vectors in the se- lected support subspaces are computed and then a composition function ◦, such as the sum or the product, is applied. Compositional function over the lat- ter support subspace evoked by the pair p2 can be correspondingly denoted by (◦) Φ2 (p1 , p2 ). A symmetric composition function can thus be obtained as a com- (◦) (◦) bination of Φ1 (p1 , p2 ) and Φ2 (p1 , p2 ) as: () (◦) (◦) Φ12 (p1 , p2 ) = Φ1 (p1 , p2 )  Φ2 (p1 , p2 ) (8) where the composition function  (again the sum or the product) between the similarities over the left and right support subspaces is applied. Notice how the left and right composition operators (◦) may differ from the overall composition operator . More details are discussed in [1]. 4 Experimental Evaluation This experimental evaluation aims to estimate the effectiveness of the proposed class of projection based methods in capturing similarity judgments over phrases and syntactic structures. In particular, a first evaluation is carried out to measure the correlation of the operator outcomes with judgments provided by human annotators. The generalization capability of the operators is measured in the second evaluation in order to prove their applicability in semantic search complex systems. Moreover the latter experiments are carried out in a cross-language setting, i.e. for english and italian datasets. Type First Pair Second Pair Rate Two different word space are support offer provide help 7 derived for the different languages. VO use knowledge exercise influence 5 For English, the word space is achieve end close eye 1 derived from the ukWak [4], a old person right hand 1 web-based corpus consisting of AdjN vast amount large quantity 7 about 2 billion tokens. For Ital- economic problem practical difficulty 3 ian, the Italian Wikipedia cor- tax charge interest rate 7 pus1 has been employed. It con- NN tax credit wage increase 5 sists of about 200 million to- bedroom window education officer 1 kens and more than 10 million sentences. The space construc- Table 2. Example of Mitchell and Lapata dataset tion proceeds from an adjacency for the three syntactic relations verb-object (VO), matrix M on which Singular Val- adjective-noun (AdjN) and noun-noun (NN) ues decomposition ([7]) is then applied. Part-of-speech tagged words have been collected from the corpus to re- duce data sparseness. Then all target words tws occurring more than 200 times are selected, i.e. more that 50,000 candidate features. Each column i of M rep- resents a word w in the corpus. Rows model the target words tw, i.e. contain the pmi values for the individual features fi , as captured in a window of size ±3 around tw. The most frequent 20,000 left and right features fi are selected, so that M expresses 40,000 contexts. SVD is here applied to limit dimensionality to N = 100. 4.1 Experiment I The first evaluation is carried out over the dataset proposed by [16], which is part of the GEMS 2011 Shared Evaluation. It consists of a list of 5,833 adjective-noun (AdjN), verb-object (VO) or noun-noun (NN) pairs, rated with scores ranging from 1 The corpus is developed by the WaCky community and it is available in the Wacky project web page at http://medialab.di.unipi.it/Project/QA/wikiCoNLL.bz2 1 to 7. In Table 2, examples of pairs and scores are shown. The correlation of the similarity judgements outputed by a DCS model against the human judge- ments is computed using Spearman’s ρ, a non-parametric measure of statistical dependence between two variables proposed by [15]. Model AdjN NN VO Additive .69 .70 .64 Mitchell&Lapata Word Space SVD Multiplicative .38 .43 .42 Φ(+) , Π12 k (k=30) .70 .71 .63 Support Subspace[1] (·) (+) Φ12 , Φi , Πik (k=40) .68 .68 .64 Max .88 .92 .88 Agreement among Human Subjects Avg .72 .72 .71 Table 3. Spearman’s ρ correlation coefficients across Mitchell and Lapata models and the projection-based models proposed in Section 3. Word space refers to the source spaces used as input to the LSA decomposition model. Table 3 reports M&L performances in the first row, while in the last row the max and the average interannotator agreement scores for the three categories derived through a leave one-out resampling method are shown. Row 2 shows Speraman’s correlation for support subspace models discussed in [1] that better perform the distributional compositional task. Notice that different configura- tions according to the models described in Section 3 are used. For example, the (·) (+) system denoted as Φ12 , Φi , Πik (k=40), corresponds to a multiplicative sym- (·) metric composition function Φ12 (as for Eq. 8) based on left and right additive (+) compositions Φi (i = 1, 2 as in Eq. 7), derived through a projection Πik in the support space limited to the first k = 40 components for each pair (as for Eq. 5). The specific operator denoted by Φ(+) , Π12 k (k=30) achieves the best performance over two out of three syntactic patterns (i.e. AdjN and NN) and is close to the best figures for VO. Experimental evaluation shows that the best performances are achieved by the projection based operators proposed. Notice that the distributional composition between verbs and objects is a very tricky task and results are in line with the additive model. Globally the results of our models are close to the average agreement among human subjects, this latter representing a sort of upper bound for the underlying task. It seems that latent topics (as extracted through SVD from sentence and word spaces) as well as the projections operators defined by support subspaces, provide a suitable com- prehensive paradigm for compositionality. They seem to capture compositional similarity judgements that are significantly close to human ones. Notice that different settings of the projection operations can influence the per- formances. A more exhaustive study of the possible settings is presented in [1]. 4.2 Experiment II In this second evaluation, the generalization capability of the employed operators will be investigated. A verb (e.g. perform) can be more or less semantically close to another verb (e.g. other verbs like solve, or produce) depending on the context in which it appears. The verb-object (VO) composition specifies the verb’s meaning by expressing one of its selectional preferences, i.e. its object. In this scenario, we expect that a pair such as perform task will be more similar to solve issue, as they both reflect an abstract cognitive action, with respect to a pair like produce car, i.e. a concrete production. This kind of generalization capability is crucial to effectively use this class of operators in a QA scenario by enabling to rank results according to the complex representations of the question. Moreover, both English and Italian languages can be considered to demonstrate the impact in a cross language setting. Figure 4 shows a manually developed dataset. It consists of 24 VO word pairs in English and Italian, divided into 3 different semantic classes: Cognitive, Ingest Liquid and Fabricate. Semantic Class English Italian perform task svolgere compito solve issue risolvere questione handle problem gestire problema use method applicare metodo Cognitive suggest idea suggerire idea determine solution trovare soluzione spread knowledge divulgare conoscenza start argument iniziare ragionamento drink water bere acqua ingest syrup ingerire sciroppo pour beer versare birra swallow saliva inghiottire saliva Ingest Liquid assume alcohol assumere alcool taste wine assaggiare vino sip liquor assaporare liquore take coffee prendere caff produce car produrre auto complete construction completare costruzione fabricate toy fabbricare giocattolo build tower edificare torre Fabricate assemble device assemblare dispositivo construct building costruire edificio manufacture product realizzare prodotto create artwork creare opera Table 4. Cross-linguistic dataset This evaluation aims to measure how the proposed compositional operators group together semantically related word pairs, i.e. those belonging to the same class, and separate the unrelated pairs. Figure 1 shows the application of two models, the Additive (eq. 1) and Support Subspace (Eq. 8) ones that achieve the best results in the previous experiment. The two languages are reported in different rows. Similarity distribution between the geometric representation of verb pair, with no composition, has been investigated as a baseline. For each lan- guage, the similarity distribution among the possible 552 verb pairs is estimated and two distributions of the infra and intra-class pairs are independently plotted. In order to summarize them, a Normal Distribution N (µ, σ 2 ) of mean µ and variance σ 2 are employed. Each point represents the percentage p(x) of pairs in a group that have a given similarity value equal to x. In a given class, the VO-VO pairs of a DCS operator are expected to increase this probability with respect to the baseline pairs V-V of the same set. Viceversa, for pairs belonging to different classes, i.e. intra-class pairs. The distributions for the baseline control set (i.e. Verbs Only, V-V) are always depicted by dotted lines, while DCS operators are expressed in continuous line. Notice that the overlap between the curves of the infra and intra-class pairs corresponds to the amount of ambiguity in deciding if a pair is in the same class. It is the error probability, i.e. the percentage of cases of one group that by chance appears to have more probability in the other group. Although the actions described by different classes are very different, e.g. Ingest Liquid vs. Fabricate, most verbs are ambiguous: contextual information is expected to enable the correct decision. For example, although the class Ingest Liquid is clearly separated with respect to the others, a verb like assume could well be classified in the Cognitive class, as in assume a position. (a)  English  AddiIve   (b)  English  Support  Subspace   Verbs  Only  Rel   Verbs  Only  Rel   Verbs  Only  Unrel   Verbs  Only  Unrel   Sub-­‐space  Rel   AddiIve  Rel   Sub-­‐space  Unrel   AddiIve  Unrel   -­‐0,24   -­‐0,14   -­‐0,04   -­‐0,24   -­‐0,14   -­‐0,04   0,06   0,16   0,26   0,36   0,46   0,56   0,66   0,76   0,86   0,96   1,06   1,16   1,26   1,36   1,46   1,56   1,66   1,76   1,86   1,96   0,06   0,16   0,26   0,36   0,46   0,56   0,66   0,76   0,86   0,96   1,06   1,16   1,26   1,36   1,46   1,56   1,66   1,76   1,86   1,96   (c)  Italian  AddiIve   (d)  Italian  Support  Subspace   Verbs  Only  Rel   Verbs  Only  Rel   Verbs  Only  Unrel   Verbs  Only  Unrel   AddiIve  Rel   Sub-­‐space  Rel   AddiIve  Unrel   Sub-­‐space  Unrel   -­‐0,24   -­‐0,12   -­‐0,24   -­‐0,12   0,00   0,12   0,24   0,36   0,48   0,60   0,72   0,84   0,96   1,08   1,20   1,32   1,44   1,56   1,68   1,80   1,92   2,04   2,16   2,28   2,40   0,00   0,12   0,24   0,36   0,48   0,60   0,72   0,84   0,96   1,08   1,20   1,32   1,44   1,56   1,68   1,80   1,92   2,04   2,16   2,28   2,40   Fig. 1. Cross-linguistic Gaussian distribution of infra (red) and inter (green) clusters of the proposed operators (continous line) with respect to verbs only operator (dashed line) The outcome of the experiment is that DCS operators are always able to increase the gap in the average similarity of the infra vs. intra-class pairs. It seems that the geometrical representation of the verb is consistently changed as most similarity distributions suggest. The compositional operators seem able to decrease the overlap between different distributions, i.e. reduce the ambiguity. Figure 1 (a) and (c) report the distribution of the ML additive operator, that achieves an impressive ambiguity reduction, i.e. the overlap between curves is drastically reduced. This phenomenon is further increased when the Support Subspace operator is employed as shown in Figure 1 (b) and (d): notice how the mean value of the distribution of semantically related word is significantly increased for both languages. The probability of error reduction can be computed against the control groups. It is the decrease of the error probability of a DCS relative to the same estimate for the control (i.e. V-V) group. It is a natural estimator of the general- ization capability of the involved operators. In Table 5 the intersection area for all the models and the decrement of the relative probability of error are shown. For English, the ambiguity reduction of the Support Subspace operator is of 91% with respect to the control set. This is comparable with the additive operator results, i.e. 92.3%. It confirms the findings of the previous experiment where the difference between these operators is negligible. For Italian, the generalization capability of support subspace operator is more stable, as its error reduction is of 62.9% with respect to the additive model, i.e. 54.2%. English Italian Probability Ambiguity Probability Ambiguity Model of Error Decrease of Error Decrease VerbOnly .401 - .222 - Additive .030 92.3% .101 54.2% SupportSubspace .036 91.0% .082 62.9% Table 5. Ambiguity reduction analysis 5 Conclusions In this paper, a distributional compositional semantic model based on space projection guided by syntagmatically related lexical pairs is defined. Syntactic bi-grams are here projected in the so called Support Subspace and compositional similarity scores are correspondingly derived. This represents a novel perspective on compositional models over vector representations with respect to shallow vector operators (e.g. additive or multiplicative operations) as proposed in literature, e.g. in [16]. The presented approach focuses on selecting the most important components for a specific word pair involved in a syntactic relation in order to have a more accurate estimator of their similarity. The proposed method have been evaluated over the well known dataset in [16] achieving results close to the average human interannotator agreement scores. A first applicability study of such compositional models in typical IR systems was carried out. The operators’ generalization capability was measured proving that compositional operators can effectively separate phrase structure in different semantic clusters. The robustness of such operators has been also confirmed in a cross-linguistic scenario, i.e. in the English and Italian Language. Future work on other compositional prediction tasks (e.g. selectional preference modeling) and over different datasets will be carried out to better assess and generalize the presented results. References 1. Annesi, P., Storch, V., Basili, R.: Space projections as distributional models for semantic composition (2012), submitted for publication 2. B. Coecke, M.S., Clark, S.: Mathematical foundations for a compositional dis- tributed model of meaning. Lambek Festschirft, Linguistic Analysis, vol. 36 36 (2010), http://arxiv.org/submit/10256/preview 3. Baeza-Yates, R., Ciaramita, M., Mika, P., Zaragoza, H.: Towards semantic search. Natural Language and Information Systems pp. 4–11 (2008) 4. Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources And Evaluation 43(3), 209–226 (2009) 5. Baroni, M., Zamparelli, R.: Nouns are vectors, adjectives are matrices: representing adjective-noun constructions in semantic space. In: Proceedings of EMNLP 2010. pp. 1183–1193. EMNLP ’10, Stroudsburg, PA, USA (2010) 6. Croce, D., Giannone, C., Annesi, P., Basili, R.: Towards open-domain semantic role labeling. In: ACL. pp. 237–246 (2010) 7. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990) 8. Erk, K., Pad, S.: A structured vector space model for word meaning in context (2008) 9. Foltz, P.W., Kintsch, W., Landauer, T.K., L, T.K.: The measurement of textual coherence with latent semantic analysis (1998) 10. Frege, G.: Über sinn und bedeutung. Zeitschrift für Philosophie und philosophische Kritik 100, 25–50, translated, as ‘On Sense and Reference’, by Max Black 11. Grefenstette, E., Sadrzadeh, M.: Experimental support for a categorical composi- tional distributional model of meaning. CoRR abs/1106.4058 (2011) 12. Harris, Z.S.: Mathematical Structures of Language. Wiley, New York, NY, USA (1968) 13. Landauer, T.K., Dutnais, S.T.: A solution to platos problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psycho- logical review pp. 211–240 (1997) 14. Lin, D.: Automatic retrieval and clustering of similar word. In: Proceedings of COLING-ACL. Montreal, Canada (1998) 15. Mitchell, J., Lapata, M.: Vector-based models of semantic composition. In: In Pro- ceedings of ACL-08: HLT. pp. 236–244 (2008) 16. Mitchell, J., Lapata, M.: Composition in distributional models of semantics. Cog- nitive Science 34(8), 1388–1429 (2010) 17. Montague, R.: Formal Philosophy: Selected Papers of Richard Montague. Yale University Press (1974) 18. Pantel, P., Lin, D.: Document clustering with committees. In: SIGIR-02. pp. 199– 206 (2002) 19. Pennacchiotti, M., Cao, D.D., Basili, R., Croce, D., Roth, M.: Automatic induction of framenet lexical units. In: EMNLP. pp. 457–465 (2008) 20. Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18, 613–620 (1975) 21. Schütze, H.: Word space. In: Hanson, S.J., Cowan, J.D., Giles, C.L. (eds.) NIPS 5, pp. 895–902. Morgan Kaufmann Publishers, San Mateo CA (1993) 22. Schütze, H.: Automatic Word Sense Discrimination. Computational Linguistics 24, 97–124 (1998) 23. Turney, P.D., Pantel, P.: From frequency to meaning: Vector space mod- els of semantics. Journal of artificial intelligence research 37, 141 (2010), doi:10.1613/jair.2934