Grammatical Feature Engineering for fine-grained IR tasks Danilo Croce and Roberto Basili Department of Enterprise Engineering University of Roma, Tor Vergata {croce,basili}@info.uniroma2.it Abstract. Information Retrieval tasks include nowadays more and more com- plex information in order to face contemporary challenges such as Opinion Min- ing (OM) or Question Answering (QA). These are examples of tasks where com- plex linguistic information is required for reasonable performances on realistic data sets. As natural language learning is usually applied to these tasks, rich struc- tures, such as parse trees, are critical as they require complex resources and accu- rate pre-processing. In this paper, we show how good quality language learning methods can be applied to the above tasks by using grammatical representations simpler than parse trees. These features are here shown to achieve the state-of-art accuracy in different IR tasks, such as OM and QA. 1 Syntactic modeling of linguistic features in Semantic Tasks Information Retrieval faces nowadays contemporary challenges such as Sentiment Anal- ysis (SA) or Question Answering (QA), that are tight to complex and fine grained lin- guistic information. The traditional view in IR that represents the meaning of documents just according to the words that occur in them is not directly applicable. Statistical mod- els, such as the vector-space model or variants of the probabilistic model that express documents and queries as Bags-of-Words (BOW) [1] are too poor. Even though fully lexicalized models are well established, in recent years syntactic and semantic struc- tures expressing richer linguistic structures are becoming essential in complex IR tasks, such as Question Classification [21] and Passage Ranking [3] in Question Answering (QA) or Sentiment Analysis Opinion Mining (OM) [12]. The major problem here is that fine-grained phenomena are targeted, and lexical information alone is not sufficient. The capabilities of the BOW retrieval models do not alway provide a robust solu- tion to these real retrieval needs. For example, in a QA system a BOW IR retrieves documents matching a query, but the QA system actually needs documents that con- tain answers. The question analysis is thus crucial for the QA system to model the user information needs and to retrieve a proper answer. This is made available when the linguistic and semantic constraints imposed by the question are satisfied by an answer, thus requiring a effective selection of answer-bearing passages. Language learning systems allow to generalize linguistic observations into rules and patterns as statistical models of higher level semantic inferences. Statistical learn- ing methods make the assumption that lexical or grammatical observations are useful hints for modeling different semantic inferences, such as in document topical classifi- cation, predicate and role recognition in sentences as well as question classification in Question Answering. Lexical features here include lemmas, multiword expressions or Named Entities that can be directly observed in the texts. Features are then general- ized into predictive components in the final model, induced from the training examples. Obviously, lexical information usually implies different words to provide different con- tributions but usually neglect other crucial linguistic properties, such as word ordering. The information about the sentence syntactic structure can be thus exploited and symbolic expressions derived from the parse trees of training examples are used as features for language learning systems. These features denote the position and the rela- tionship between words that can be seemingly realized by different trees independently from irrelevant differences. For example, in a declarative sentence (such as in a S←NP VP structure), the relationship between a verbal predicate (VP) and its immediately preceding grammatical subject (NP) is literally translated in the feature VP↑VP↑S↓NP, where arrows indicate upward or downward movements through the tree. Linear ker- nels over the resulting Parse Tree Path features are employed in NLP tasks such as for Semantic Role Labeling [14] or Opinion Mining [22]. This idea is further expanded in tree kernels, introduced by [5]. These model similarity between training examples as a function of the shared subtrees in their corresponding parses. Tree kernels have been successfully applied to different tasks ranging from parsing [5] to semantic role label- ing [19]. Tree kernels are known to determine a better grammatical representation for the targeted examples and provide an implicit method for robust feature engineering. However, the adoption of grammatical features and tree kernels is still affected by significant drawbacks. First, strict requirements exist in terms of the size of the train- ing data set as high dimensionality spaces are generated, whose data sparseness can be prohibitive. Usually, the application of exact learning algorithms gives rise to complex training processes whose convergence is quite slow. Although specific forms of opti- mization have been proposed to limit their inherent complexity (e.g. [18]), tree kernels do not scale well over very large training data sets. Finally it must be noticed that most of the methods extracting grammatical features from parse trees, are strongly biased by parsing errors. We want to explore here a possible solution to the above problems through the adop- tion of shallow but more consistent grammatical features that avoid the use of a full parser in semantic tasks. Parsing accuracy is highly varying across corpora, and it is often poorly effective for some natural languages or application domains where limited resources are available or the syntactic structure of the test instances is very different with respect to the training material. In particular [7] investigates the accuracy loss of well known syntactic parsers applied to micro-blogging datasets. In particular they ob- served a drastic drop in performance moving from the in-domain test set to the new Twitter dataset. Avoiding the adoption of full parsing obviously increases the number and nature of possible uses of language technologies in a variety of complex NLP appli- cations. In IR, part of speech information has been generally used for stemming, gener- ating stop-word lists, and identifying pertinent terms or phrases in documents and/or in queries. Generally, the state of the art in IR systems tend to benefit from the adoption of parts of speech to index or retrieve information [24]. The open research questions are: which shallow grammatical representation is suit- able to support the learning of fine-grained semantic models? Which grammatical gen- eralizations can be usefully achieved over shallow syntactic representations for sentence- based inferences? In the rest of this work, we show how embedding shallow grammatical information in a sentence representation, as a special case of enriched lexical information, produces useful generalizations in standard machine learning settings. Empirical findings in sup- port to this thesis are discussed against two complex sentence-based semantic tasks, i.e. question classification and sentiment analysis in micro-blogging. 2 Shallow Parsing and Grammatical Feature engineering Grammatical feature engineering is required as lexical information alone is, in general, not sufficient to characterize linguistic generalizations useful for fine-grained semantic inferences. For example, sentence (3) is the appropriate answer for the question (1), although both sentences (2) and (3) are reasonable candidates. What French province is Cognac produced in? (1) The grapes which produce the Cognac grow in the province and the French government ... (2) Cognac is a brandy produced in Poitou-Charentes. (3) Suppose we use a lexical overlap rule for a Question Answering (QA) task: given the overlapping terms outlined in bold1 , it would result in the wrong answer (2). A simple lexical overlap model is too simplistic, as syntactic information characterizing the individual sentences (1) and (3) is here necessary. Syntactic features provide more information to estimate the similarity between the question and the candidate answers, as in general explored by tree kernels in Answer Classification/Re-ranking [20]. The parse tree in Figure 1 corresponds to sentence (3) and represents: – lexical information through its terminal nodes (e.g., words as Cognac, is, . . . ) – Coarse-grained grammatical information through the POS tag characterizing pre- terminal nodes (e.g. N N P or V BZ) – Fine-grained grammatical information as subtrees correspond to the production rules of the underlying context free grammar (CFG). Examples of the CFG rules involved in Figure 1 are: S → N P V P , V P → V BZ N P , N P → N P P or N P → DT N N . Stochastic context free grammars (e.g. [4]), are generative models for parse trees, seen as complex joint events, whose overall probability depends on the individual CFG rules (i.e., subtrees), and lexical in- formation as well. Our aim here is to acquire these rules implicitly, as a side effect of the learning for semantic inference process. Specific features can in fact be designed to surrogate the syntactic structures of the parse tree, implicitly. Observable POS tag sequences correspond to subtrees and can be considered their shallow counterpart. 1 Sentence (2) shares five terms with the sentence (1), while (3) shares only four terms. They express linearly special properties, in analogy with the Parse Tree Paths in [9]. In other words, subtrees can be artificially replaced introducing POS tag sequences (or POS n-grams), instead of parse tree fragments. The idea is that the syntactic structure of a sentence could be surrogated as the POS n-grams, instead of the set of possible syntactic tree fragments, as used by tree kernels. For example, the partial tree expressed by VP→VBN PP in Fig. 1 can be represented through the pseudo token given by VBN- IN-NNP. S NP VP NNP VBZ NP Cognac is NP VP DT NN VBN PP a brandy IN NP produced in NNP Poitou-Charentes Fig. 1. Example of parse tree associated to sentence (3) Lexicalized features (i.e., true words) as well as shallow syntactic information (i.e., the POS n-grams) are thus made available as flat features, thus constraining the capac- ity of the underlying learning machine. A sentence s of length |s| is thus represented as a set of words (in a bag-of-word fashion), extended by the pseudo tokens defin- ing the corresponding POS tag sequences whose length is smaller that n (n-POS tag grams). Given the word sequence s = {w1 , . . . , w|s| } whose corresponding part-of- speeches are {pos1 , . . . , pos|s| }, the representation of the pseudo tokens is the set of pairs {(w1 .pos1 ), . . . , (w|s| .pos|s| }, where each lemmatized word is coupled with its POS tag. Moreover, in order to capture syntactic structures of interest, POS tags are also mapped into pseudo-tokens expressing their sequences (i.e., POS n-grams). Given n as the maximal size of the extracted sequences, every subsequence of length at most n is mapped into a pseudo-token. These novel grammatical tokens of length ∆ are ex- pressed as {pj , . . . , pj+∆ } where ∆ = 1, ..., n. In these patterns the representation of prepositions (POS tag IN) is made explicit. Every position k ∈ [j, j + ∆] for which posk =IN is represented through wk itself, so that at-NP or of -DT-NN are obtained as pseudo-tokens for fragments such as “at Whitlock” or “of the vineyard”. The represen- tation of sentence (3) is shown in Table 2, where words (wi .posi ) and n-gram tokens are shown. Table 1. Representation of lexical and grammatical information for sentence (3) cognac.NNP be.VBZ a.DT brandy.NN produce.VBN in.IN unigrams poitou-charentes.NNP 2-grams NNP-VBZ VBZ-DT DT-NN NN-VBN VBN-in in-NNP NNP-. NNP-VBZ-DT VBZ-DT-NN DT-NN-VBN NN-VBN-in 3-grams VBN-in-NNP in-NNP-. NNP-VBZ-DT-NN VBZ-DT-NN-VBN DT-NN-VBN-in 4-grams NN-VBN-in-NNP VBN-in-NNP-. 2.1 Shallow Syntactic Features for Question Classification In Question Answering three main processing stages are foreseen: question processing, document retrieval and answer extraction [16]. Question processing is usually centered around the so called question classification (QC) task that maps a question into one of k predefined answer classes [17]. Typical examples of classes characterize differ- ent answer strategies and range from questions regarding persons or organizations (e.g. Who killed JFK?) to definition questions (e.g. What is a perceptron?) or modalities (e.g. How fast does boiling water cool?). Highly accurate QC systems apply supervised ma- chine learning techniques, e.g. Support Vector Machines (SVMs) [20, 23] or the SNoW model [17], where questions are encoded using a variety of lexical, syntactic and seman- tic features. In [17], it has been shown that the questions’ syntactic structure contributes remarkably to the classification accuracy. This task is thus strictly syntax-dependent, especially because individual sentences are targeted. As questions can be regarded as individual sentences, we will adopt the feature extraction scheme proposed in Table 2 for our QC models. These features represent both lexical and grammatical information that can be efficiently feed a statistical classifier based on linear kernels. Section 3.1 will discuss comparative experiments with previous works on Question Classification. 2.2 Shallow Syntactic Features for Sentiment Analysis over micro-blogging Microblogging has been already established as a significant form of electronic word- of-mouth for sharing opinions, suggestions and consumer reviews concerning ideas, products or brands. Microblogging is also referred to as micro-sharing or Twittering (from Twitter2 by far the most popular microblogging application). While opinion min- ing over traditional text sources (e.g. movie reviews or forums) has been significantly studied [22], sentiment analysis over tweets has a more recent history, [10] or [2]. It has been usually addressed on the basis of only lexical information whereas the syntactic structure of tweets is often neglected [22]. In [25] the linguistic redundancy in Twitter is investigated and several types of linguistic features are tested in a supervised setting, showing that tweet syntactic structure does not provide alone a statistically significant contribution with respect to lexical typed features. The main problem of syntax-driven 2 http://www.twitter.com approaches over tweets is the quality of the available grammatical information as tweets are sentences lacking of a proper grammatical structure. Here the modeling through POS n-grams is suitable to overcome these problems, as it provides a simpler representation of the tweets’ syntax and, on the other hand, it should be more robust as for tagging accuracy. However even POS taggers, trained over standard texts, may be inadequate, as the linguistic form of tweets is rather non standard with a large use of jargon and shortcuts. An interesting finding in [7] was that one of the main cause of the syntactic parsing errors over the Twitter dataset is due to the propagation of part-of-speech tagging errors. In line with other works (see for example [10] or [15]), we propose to pre-process tweets before a standard POS tagger is applied. This avoids the noise in applying traditional POS tagging to odd symbols (e.g. re-tweets or emoticons) or jargon expressions and also reduces data sparseness, as canonical forms are adopted. The following set of actions is applied before training: – fully capitalized words are first converted in their lowercase counterpart, i.e. ”DOG” into ”dog”, before applying POS tagging – reply marks (i.e. @user name) are replaced with the pseudo-token USER whose POS tag is set back to PUSER after POS tagging – hyperlinks are replaced by the token LINK whose POS is PLINK – hash tags (i.e. #thread name) are replaced by the pseudo-token THREAD whose POS is imposed to PTHREAD – repeated letters and punctuation characters (e.g. looove, loooove or !!!) are cleansed as they cause high levels of lexical data sparseness. Characters occurring more than twice are all replaced with a double occurrence expression, so that looove or !!! are mapped into loove or !!, respectively – all emoticons, e.g. :-) or :P, are used as sentence separators although they are sys- tematically misinterpreted by a standard POS tagger. Accordingly, they are first replaced with a full stop ”.” and then recovered at their original form after POS tagging. Their POS is always set to SMILE. After the above pre-processing phase, a tweet like @jdoe I looove Twitter! :-) http://twitpic.com/2y2e0 can be represented according to the model proposed in Sec- tion 2. Here the lists of lexical unigrams and grammatical n-grams are reported: USER.PUSER i.PRP loove.VBP twitter.NNP!.PUNC :-).SMILE LINK.PLINK PUSER PRP PRP VBP VBP NNP NNP PUNC PUNC SMILE SMILE PLINK USER PRP VBP ... As it is clear from the example, the resulting POS sequences are able to better capture the intended syntax and act as good models of relevant grammatical relations: the sequence USER.PUSER i.PRP loove.VBP ..., for example, is a good hint for the positive bias introduced by loove as a verb. 3 Performance Evaluation In this section we evaluate the use of POS n-grams in two applications previously dis- cussed as standard example of different semantic inferences useful for IR. In all the experiments POS tagging is carried out by the tagger available in the LTH parser [13]. The performance achievable by POS n-grams is thus compared with the one derived by richer grammatical representations based on parse trees. 3.1 Question Classification Results This first experiment studies the impact of combining lexical and shallow syntactic information (i.e. POS n-grams), on question classification. The targeted dataset is the UIUC corpus, largely adopted for benchmarking [17]. UIUC contains a training set of 5,452 questions and a test set of 500 questions, both extracted from TREC. Question classes are organized in two levels of granularity. At the first level, 6 coarse-grained classes are defined, like ABBREVIATION, ENTITY, DESCRIPTION. A second level explodes the first level classes into a set of 50 fine-grained sub-classes, e.g., Plant and Food are subclasses of the ENTITY category. SVM learning is applied over the feature vectors discussed in Section 2.1 and multi- classification is modeled through a one-vs-all scheme. The quality of classification is measured through accuracy, i.e. the percentage of questions associated with the cor- rect class. A development set is derived from the 20% of the training material. In the experiments two sentence models are compared: – POS tagged Unigrams (PU): a question is mapped into a bag of POS tagged lem- mas, i.e. into pairs of (lemma.pos). This model is based only on lexical informa- tion. – POS n-grams (PnG): each question is modeled by augmenting the P U model through the shallow syntactic information provided by the sequence of n-grams of POS tags, with n < 4. The POS of W h-determiners and prepositions are replaced in the individual POS n-grams by the corresponding lemmas. In this evaluation the voted perceptron [8] and SM V light [11] have been both ap- plied3 . Results, compared with the results achieved by the system discussed in [23] on the same UIUC dataset, are shown in Table 2. The authors combine a kernel classifier based on BOW with two semantic kernels: one (i.e. K(LS)) is based on Latent Semantic Indexing applied to Wikipedia, and the other (i.e. K(semRel)) uses semantic informa- tion acquired through manually constructed lists of words, i.e., a task-specific lexicon related to the answer types. In the coarse-grained test, i.e. the question classification with respect to the 6 coarse grained classes, Table 2 shows how the syntactic generalization supported by the P nG model achieves the best known results on the UIUC dataset, i.e., 91.8% that correspond to the accuracy reported by a tree kernel approach [20], without any semantic extension. This improves the best results of [23] (i.e., the K(BOW ) + K(LS) + K(semRel)) that refer to a task-dependent use of manually annotated resources. Note how the ker- nel K(LS) that uses only lexical information, gathered by an external corpus like Wikipedia [23] is also weaker than the P nG model, that makes no use of trees or other 3 In the experiments a polynomial kernel of degree 2 has been applied with SM V light , as it achieved the best result on the development set Table 2. Accuracy measures for the QC task Coarse Fine-grain Kernel Task Task P U (VotedPerc) 89.2% 81.4% P U (SVM) 89.4% 83.8% P nG (VotedPerc) 91.4% 84.0% P nG (SVM) 91.8% 84.8% [23] K(BOW) 86.4% 80.8% K(LS) 70.4% 71.2% K(BOW)+K(LS) 90.0% 83.2% K(BOW)+K(LS)+K(semRel) 90.8% 85.6% [20] Tree Kernels K(BOW)+K(PartialTrees) 91.8% - resources. The results in Table 2 are also remarkable from a computational point of view: the P nG method only requires POS tagged sentences and no parsing. Moreover, the training time of tree kernel based SVMs on benchmarking data sets are in the order of hours or days for large training collections (e.g., Prop Bank, as reported in [18]). In [6] an extension to the tree kernel formulation has been proposed, i.e. the se- mantic Smoothed Partial Tree Kernel that enriches the similarity among syntactic tree structures with lexical information gathered by en external corpus, in line with the K(LS) described in [23]. State-of-the art results of 94.8% have been obtained in the coarse-grained test. However it is still a complex approach that need explicit syntactic parsing of the sentences and an external corpus that provides lexical knowledge. This is beyond the scope of this work, that aims at providing an efficient and practical engi- neering method for natural language learning systems. The training complexity of the proposed models is very low. Consider that for a short sentence (i.e. a question or a micro-blogging message) the number of feature is reduced. For example a sentence of 10 words, will generate 10 lexical, 9 bi-gram, 8 three-gram and 7 four-gram features, i.e. a feature vector of 34 features. It generates a hi-dimensional but very sparse space, where both SVM and the vote perceptron algorithms can very effectively find a solu- tion. The efficiency of the proposed method in the QC task is thus proved, as the PnG model has been trained over 5,452 examples in less than 2 minutes and 40 seconds, with SM V light and the voted perceptron, respectively. 3.2 Sentiment Analysis Results The POS n-grams model has been also applied in the task of Sentiment Analysis over tweets, as introduced in Section 2.2. The goal here is to classify a tweet according to its sentiment polarity. The adopted dataset is Twitter Sentiment, released by [10]4 , as other studies (e.g. [2]) do not allow a full comparative analysis. It provides a training 4 http://www.stanford.edu/∼alecmgo/cs224n/twitterdata.2009.05.25.c.zip set automatically generated by selecting the positive (or negative) examples from the tweets containing positive (or negative) emoticons, e.g. :-) (or :-( ). The test set, also made available by [10], includes 183 tweets, manually annotated according to their binary sentiment polarity, i.e. ±1. Each tweet is modeled as a feature vector, including words as well as the pseudo-tokens generated in the pre-processing phase, including the resulting POS n-grams (see Section 2.2). SM V light has been applied, with a 50-50% train-development splitting: in this setting a linear kernel provided the best results. Table 3. Experimental results for the Sentiment Analysis task Unigrams 77.60% POS tagged Unigrams 82.51% Noisy POS 4-grams (no pre-proc.) 77.59% POS 4-grams 83.61% Unigrams [10] 82.20% POS tagged Unigrams [10] 83.00% As Table 3 suggests, the results improve on [10], as the adopted grammatical infor- mation is helpful. The test set employed in our experiments is slightly more complex, as the Unigrams model achieves a significantly worse result than in [10]. Moreover, without pre-processing, POS tags are inaccurate and this reflects in the lower perfor- mances of the Noisy POS 4-grams model. Our approach achieves a new state-of-art (i.e. 83.61%) on the dataset. This results due to the grammatical information provided by the POS n-grams and the contribution of the proposed pre-processing method is cru- cial. When no pre-processing is applied, the noise introduce by the POS-tagger would produce a consistent performance reduction, i.e. 77.59% vs 82.51%. Error analysis Fig. 2. Twitter Sentiment Analysis: accuracy suggests that mistakes (e.g. the positive polarity given to the tweet ”Kobe is the best in the world not Lebron”) are due to lack of information. If LeBron James (and not Kobe) is the focus then the polarity is negative. But the alternative decision would have been perfectly acceptable, otherwise. Figure 2 reports the learning curve for the system with and without POS n-grams: POS n-grams are responsible of a faster convergence to higher accuracy levels. 4 Conclusions In this paper shallow grammatical features as sequences of POS tags (i.e. POS n-grams) are proposed as a robust and effective model of grammatical information in differ- ent semantic tasks. Every experiment shows that state-of-the-art results are achieved or closely approximated by our modeling. Although standard training algorithms are here adopted, simple kernels over POS n-grams are quite effective, as for example the sentiment analysis tests demonstrate. Surprisingly, in Question Classification our model equals the accuracy of a performant tree kernel. The training complexity of the proposed models is very low. Although several optimization methods for tree kernel learning have been proposed (e.g. [6, 18]), our simpler approach is more applicable by posing much weaker requirements in terms of quality and size of the annotated datasets. This makes the proposed technology quite appealing for complex NLP and IR applications, such as the treatment of noisy sources that current micro-blogging trends require. This is also shown by the performances observed in the tweet sentiment analysis task, for which state-of-the-art results are obtained. References 1. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Long- man Publishing Co., Inc., Boston, MA, USA (1999) 2. Barbosa, L., Feng, J.: Robust sentiment detection on twitter from biased and noisy data. In: Coling 2010: Posters. pp. 36–44. Coling 2010 Organizing Committee, Beijing, China (August 2010) 3. Bilotti, M.W., Elsas, J.L., Carbonell, J., Nyberg, E.: Rank learning for factoid question an- swering with linguistic and semantic constraints. In: Proceedings of ACM CIKM (2010) 4. Collins, M.: Three generative, lexicalised models for statistical parsing. In: Proceedings of ACL 1997. pp. 16–23 (1997) 5. Collins, M., Duffy, N.: Convolution kernels for natural language. In: Proceedings of Neural Information Processing Systems (NIPS). pp. 625–632 (2001) 6. Croce, D., Moschitti, A., Basili, R.: Structured lexical similarity via convolution kernels on dependency trees. In: Proceedings of EMNLP. Edinburgh, Scotland, UK. (2011) 7. Foster, J., Özlem Çetinoğlu, Wagner, J., Roux, J.L., Hogan, S., Nivre, J., Hogan, D., van Genabith, J.: #hardtoparse: Pos tagging and parsing the twitterverse. In: Prooceedings of AAAI-11 Workshop on Analysing Microtext. San Francisco, CA (August 2011) 8. Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. Ma- chine Learning Journal 37(3), 277–296 (1999) 9. Gildea, D., Jurafsky, D.: Automatic Labeling of Semantic Roles. Computational Linguistics 28(3), 245–288 (2002) 10. Go, A., Bhayani, R., Huang, L.: Twitter Sentiment Classification using Distant Supervision. In: CS224N Project Report, Stanford (2009) 11. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: In Proceedings of the European Conference on Machine Learning (1998) 12. Johansson, R., Moschitti, A.: Extracting opinion expressions and their polarities – explo- ration of pipelines and joint models. In: Proceedings of ACL-HLT. Portland, Oregon, USA (2011) 13. Johansson, R., Nugues, P.: Dependency-based syntactic-semantic analysis with propbank and nombank. In: Proceedings of CoNLL-2008. Manchester, UK (August 16-17 2008) 14. Johansson, R., Nugues, P.: The effect of syntactic representation on semantic role labeling. In: Proceedings of COLING. Manchester, UK (August 18-22 2008) 15. Kaufmann, J., Kalita, J.: Syntactic normalization of twitter messages. In: International Con- ference on Natural Language Processing (2010) 16. Kwok, C.C.T., Etzioni, O., Weld, D.S.: Scaling question answering to the web. In: WWW. pp. 150–161 (2001) 17. Li, X., Roth, D.: Learning question classifiers. In: Proceedings of ACL’02 (2002) 18. Moschitti, A.: Efficient convolution kernels for dependency and constituent syntactic trees. In: ECML. pp. 318–329. Machine Learning: ECML 2006, 17th European Conference on Machine Learning, Proceedings, Berlin, Germany (September 2006) 19. Moschitti, A., Pighin, D., Basili, R.: Tree kernels for semantic role labeling. Computational Linguistics 34 (2008) 20. Moschitti, A., Quarteroni, S., Basili, R., Manandhar, S.: Exploiting syntactic and shallow semantic kernels for question answer classification. In: In Proc. of ACL-07. pp. 776–783 (2007) 21. Moschitti, A., Quarteroni, S., Basili, R., Manandhar, S.: Exploiting syntactic and shallow semantic kernels for question/answer classification. In: Proceedings of ACL’07 (2007) 22. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends in Infor- mation Retrieval 2(1-2), 1–135 (Jan 2008) 23. Tomás, D., Giuliano, C.: A semi-supervised approach to question classification. In: Pro- ceedings of the 17th European Symposium on Artificial Neural Networks, Bruges, Belgium (2009) 24. Voorhees, E.M., Harman, D.: Overview of the seventh text retrieval conference trec-7. In: Proceedings of the Seventh Text REtrieval Conference (TREC-7. pp. 1–24 (1998) 25. Zanzotto, F.M., Pennacchiotti, M., Tsioutsiouliklis, K.: Linguistic redundancy in twitter. In: Proceedings of 2011 Conference on Empirical Methods on Natural Language Processing (EmNLP) (2011)