Nyström Methods for Efficient Kernel-Based Methods for Community Question Answering Danilo Croce1 , Simone Filice2 and Roberto Basili1 1 Dept. of Enterprise Engineering 2 Dept. of Civil Engineering and Computer Science Engineering University of Roma, Tor Vergata, Italy {croce,filice,basili}@info.uniroma2.it Abstract estremamente efficienti e scalabili. In questo lavoro si dimostra che é possibile English. Expressive but complex kernel applicare metodi kernel al problema di functions, such as Sequence or Tree ker- Community Question Answering, otte- nels, are usually underemployed in NLP nendo risultati che sono lo stato dell’arte, tasks, e.g., in community Question An- riducendo di ordini di grandezza i costi swering (cQA), as for their significant computazionali. complexity in both learning and classi- fication stages. Recently, the Nyström methodology for data embedding has been 1 Introduction proposed as a viable solution to scala- bility problems. By mapping data into Kernel methods (Shawe-Taylor and Cristianini, low-dimensional approximations of kernel 2004) have been employed in several Machine spaces, it positively increases scalability Learning algorithms (Crammer et al., 2006; Vap- through compact linear representations for nik, 1998) achieving state-of-the-art performances highly structured data. In this paper, we in many classification tasks. Recently, the kernel show that Nyström methodology can be based approach presented in (Filice et al., 2016) effectively used to apply a kernel-based has been applied in the community Question An- method in the cQA task, achieving state- swering (cQA) challenge at SemEval 2016 (Nakov of-the-art results by reducing the compu- et al., 2016) obtaining state-of-the-art results. tational cost of orders of magnitude. Unfortunately, when large data volumes are in- volved, time and space complexity required in Italiano. Metodi di apprendimento learning and classification may prevent the adop- automatico basato su funzioni ker- tion of expressive but complex kernel functions, nel complesse, come Sequence o Tree such as Sequence (Cancedda et al., 2003) or Tree Kernel, rischiano di non poter essere kernels (Collins and Duffy, 2001). In particular, adeguatamente utilizzati in problemi the classification cost required by a kernel-based legati all’elaborazione del linguaggio model crucially depends on its number of support naturale (come ad esempio in Community vectors: classifying a new instance requires a ker- Question Answering) a causa degli alti nel computation against all support vectors. This costi computazionali per l’addestramento scalability issue is evident in many NLP and IR ap- e la classificazione. Recentemente é stata plications, such as in re-ranking answers in ques- proposta una metodologia, basata sul tion answering (Moschitti et al., 2007; Severyn et metodo di Nyström, per poter far fronte a al., 2013; Filice et al., 2016), where the number of questi problemi di scalabilitá: essa per- support vectors is typically very large. mette di proiettare gli esempi, osservabili Some approaches have been defined to bound in fase di addestramento e classificazione, the complexity of kernel-based methods, such as all’interno di spazi a bassa dimensionalitá (Wang and Vucetic, 2010; Vedaldi and Zisserman, che approssimano lo spazio sottostante 2012; Filice et al., 2014), but they are still specific la funzione kernel. Queste rappresen- to kernel formulations and learning algorithms. tazioni compatte permettono di applicare In (Croce and Basili, 2016) it has been shown algoritmi di apprendimento automatico that a viable and more general solution to the above scalability issues is the Nyström method- dimensions is proportional to the number of possi- ology, a dimensionality reduction technique that ble sub-trees in a Natural Language. Kernel func- has been applied also in kernel-based methods tions are exploited by kernel-based learning algo- since (Williams and Seeger, 2001). This method- rithms, such as SVM (Vapnik, 1998), to operate on ology has been designed to approximate the Gram the implicit kernel space without its explicit defi- Matrix derivable by a kernel function, enabling nition. the projections of examples into low-dimensional Let us assume that, given a kernel K, its ex- spaces. The Nyström projection function is gen- plicit projection function φ over D is available to erated by using some examples called landmarks, derive new representations ~xi being the rows of whose number directly impacts on the embeddings the resulting matrix X. We define the Gram Ma- quality; dually, costs of projecting a new example trix as G = XX > , with each single element corre- in the embedding space rise linearly with the num- sponding to Gij = Φ(oi )Φ(oj ) = K(oi , oj ). The ber of landmarks, that is usually of orders of mag- aim of the Nyström method is to derive a new low- nitude lower with respect of the number of pos- dimensional embedding in a l-dimensional space, sible support vectors that can be derived from a with l  n so that G ≈ G̃ = X̃ X̃ > . This is ob- learning process. Once each example is projected tained by generating an approximation of G using in the dense low-dimensional space, the applica- a subset of l columns of the matrix. This corre- tion of efficient linear learning methods is enabled, sponds to selecting a subset L of the available ex- such as (Hsieh et al., 2008), preserving at the same amples, called landmarks. Suppose we randomly time the expressiveness and effectiveness of ker- sample l columns of G, and let C be the n × l nel methods. This approach is highly applicable matrix of these sampled columns. Then, we can to different input data as well as to different ker- rearrange the columns and rows of G and define nels or learning algorithms, as discussed in (Croce X = [X1 X2 ] such that: and Basili, 2016). X1> X2   > W In this paper we show that the Nyström method G = XX = X2> X1 X2> X2 can be effectively used in the cQA task, by adopt-   ing the same kernel functions proposed in (Filice W and C = (1) et al., 2016) and obtaining the same results w.r.t. X2> X1 the metrics adopted in the SemEval task, by reduc- where W = X1> X1 , i.e., the subset of G that only ing the computational cost of orders of magnitude. considers landmarks. The Nyström approximation In Section 2, we demonstrate the viability of can be defined as: the Nyström method to reduce the computational costs of kernel machines. Experimental results G ≈ G̃ = CW † C > (2) (obtained by adopting efficient SVM learning over where W † denotes the Moore-Penrose inverse of the cQA task) are discussed in Section 3. Finally, W . The Singular Value Decomposition (SVD) Section 4 describes related work, while in Section is used to obtain W † as it follows. First W is 5 conclusions are derived. decomposed so that W = U SV > where U and V are both orthogonal matrices, and S is a di- 2 Linearizing linguistic properties agonal matrix containing the (non-zero) singular through Nyström Approach values of W on its diagonal. Since W is sym- metric and positive definite W = U SU > . Then Given an input training dataset oi ∈ D, a kernel 1 1 W † = U S −1 U > = U S − 2 S − 2 U > and the Equa- function K(oi , oj ) is a similarity function that cor- tion 2 can be rewritten as responds to a dot product in the implicit kernel 1 1 space, i.e., K(oi , oj ) = Φ(oi ) · Φ(oj ). The ad- G ≈ G̃ = CU S − 2 S − 2 U > C > vantage of kernels is that the projection function 1 1 = (CU S − 2 )(CU S − 2 )> = X̃ X̃ > (3) Φ(oi ) = ~xi ∈ Rn is never explicitly computed (Shawe-Taylor and Cristianini, 2004). In fact, this Given an input example oi ∈ D, a new low- operation may be prohibitive when the dimension- dimensional representation ~x˜i can be thus deter- ality n of the underlying kernel space is extremely mined by considering the corresponding i-th item large. For example, Tree Kernels (Collins and of C as Duffy, 2001) give rise to spaces whose number of ˜i = Θ(oi ) = c~i U S − 12 ~x (4) where c~i corresponds to a vector whose dimen- the new space, i.e., O kl). In fact, once a test ex- sions contain the evaluation of the kernel func- ample is projected, the final decision requires a dot tion between oi and each landmark oj ∈ L. The product between the low-dimensional representa- method produces l-dimensional vectors, and no re- ˜i and the hyperplane underlying the classi- tion ~x striction is applied to the input dataset as long as a fication function: again, this is negligible with re- valid K(oi , oj ) is used. spect to the cost of the single kernel operations. Several policies have been defined to determine Such cost is typically extremely lower than the the best selection of landmarks to reduce the Gram cost of a pure kernel-based classification, which Matrix approximation error. In this work the uni- requires a kernel operation againts all the support form sampling without replacement is adopted, as vectors selected during the training process, which suggested by (Kumar et al., 2012), where this pol- is usually far larger than the number of landmarks. icy has been theoretically and empirically shown to achieve results comparable with other (more 3 Empirical Investigation: the complex) selection policies. Community QA task Assuming that k is the computational cost1 of a single kernel operation, the runtime cost of the The proposed stratified Nyström method has been Nyström method is O(knl + l3 + nl2 ) as it de- applied in the SemEval-2016 community Ques- pends on (i) the computation of the n × l ma- tion Answering (cQA) task. In this task, partici- trix C, i.e., O(knl); (ii) the SVD evaluation on pants are asked to automatically provide good an- W , which is O(l3 ); and (iii) the projection of swers in a community question answering setting the entire dataset through the multiplication by C, (Nakov et al., 2016). i.e., O(nl2 ). For several classes of kernels, such In particular, we focused on the Subtask A: as Tree or Sequence Kernels (Collins and Duffy, given a question and a large collection of question- 2001), the kernel computation cost is extremely comment threads created by a user community, high. Therefore, the computational cost for the the task consists in (re-)ranking comments that are construction of the matrix C dominates the overall most useful for answering the question. This task expression. is interesting as kernel methods achieved the high- Once an example is projected in the l- est results in the cQA task, as demonstrated by dimensional space, efficient and large-scale learn- the KeLP team (Filice et al., 2016). In particu- ing algorithm can be applied. To further control lar, Subtask A is modeled as a binary classification the computational cost of the training step, we ad- problem, where examples are generated by con- dressed a class of algorithms that bounds the num- sidering (question,comment) pairs. Each pair gen- ber of times a single instance is re-used during erates an example for a binary SVM, where the training. In particular, we investigated the Dual positive label is associated with a good comment Coordinate Descent algorithm (Hsieh et al., 2008): and the negative label includes the potential and it is a batch learning algorithm whose achievable bad comments. The classification score is used accuracy is made inversely dependent on the num- to sort the instances and produce the final rank- ber of iterations T over a training dataset. Its train- ing. According to the above setting, a train and ing time cost on a dataset of n examples in Rl is test dataset made of 20,340 and 3,270 examples O(T nl). Being fixed the number of iterations re- are generated. In (Filice et al., 2016), a Kernel- quired to obtain an accurate model2 , such cost is based SVM classifier achieved state-of-the-art re- negligible w.r.t. the projection cost. Therefore, a sults by adopting a kernel combination that ex- complete training process exploiting the Nyström ploited (i) feature vectors containing linguistic method is simply O kln), that should be com- similarities between the texts in a pair; (ii) shallow pared with a traditional kernel-based SVM learn- syntactic trees that encode the lexical and morpho- ing algorithm, e.g., (Chang and Lin, 2011), whose syntactic information shared between text pairs; computational cost is almost O kn2 ), with l  n. (iii) feature vectors capturing task-specific infor- The computational cost of a classification step mation. only depends on the projection of the example in First, a batch kernel-based SVM (Chang and 1 Lin, 2011) learning algorithm operating on the Expressed in terms of basic operations, such as products. 2 In (Croce and Basili, 2016) a number of iterations kernel function proposed in (Filice et al., 2016) T = 30 obtained stable and accurate results in several tasks. is adopted to determine the upper bound in terms of classification quality (but with higher compu- Table 1: Results in CQA. Upperbound is achieved tational costs). Then, multiple standard Nyström by a SVM with more than 37M kernel operations. methods are used to linearize the dataset by sam- Landmarks MAP F1 Saving pling different numbers of landmarks: 10 config- 100 76.0 58.6 99.1% urations have been investigated by starting from 200 77.0 60.8 98.2% 100 landmarks and incrementally adding 100 300 77.5 62.2 97.4% landmarks at a time. The higher is the number of 400 77.7 62.4 96.5% used landmarks, the higher is the quality of the ap- 500 77.9 63.1 95.6% proximated low-dimensional space (Drineas and 600 78.0 63.6 94.7% Mahoney, 2005), but the higher is also the compu- 700 78.1 63.7 93.8% tational cost. The most complex projection func- 800 78.0 63.8 92.9% tion is thus based on 1,000 landmarks. Landmarks 900 78.1 64.2 92.1% have been selected by applying a random selec- 1000 78.2 64.4 91.2% tion without replacement, as suggested in (Kumar standard SVM 79.2 64.4 - et al., 2012). An efficient linear SVM (Hsieh et ConvKN 77.7 66.2 - al., 2008) is adopted on the resulting embedding SemanticZ 77.6 61.8 - space. Experiments have been carried out by us- ing the KeLP framework3 (Filice et al., 2015a). putational costs has been early designed by im- Results are reported in Table 1 in terms of posing a budget in the number of support vec- Mean Average Precision (MAP, the official rank tors (Cesa-Bianchi and Gentile, 2006; Dekel and of the competition), F1 on the good class, and Singer, 2006; Orabona et al., 2008; Wang and computational saving, i.e., percentage of avoided Vucetic, 2010; Filice et al., 2014). However, kernel operations in classification. The standard in complicated tasks, such methods still require SVM model contains 11,322 Support Vectors, thus large budgets that systematically rely on many requiring more than 37M kernel operations for kernel computations. They are thus less efficient the complete classification of the 3,270 test in- than Nyström: a classifier based on the Nyström stances4 . By adopting the Nyström methodol- method with l landmarks has approximately the ogy with only 1,000 landmarks the same F1 score same computational complexity of its budgeted (i.e., 64.4) is obtained. Moreover, a comparable counterpart with a budget set to l, but its accuracy MAP (i.e., 78.2%) achieved by the KeLP team is is typically higher, as shown in (Croce and Basili, replicated with a 91.2% of saving. The speed 2016). Alternatively, Zanzotto and Dell’Arciprete up is impressive also when fewer landmarks are (2012) proposed Distributed Tree Kernels that ap- used: with 300 landmarks, 77.7 MAP is obtained proximate tree kernels (Collins and Duffy, 2001) by saving more that 97% of kernel computations. through the explicit mapping of trees into vec- These results are straightforward, considering that tors. DTKs focus on specific tree kernel functions, results comparable with the state-of-the-art can be while the approach proposed here can be effec- obtained by reducing of almost two orders of mag- tively applied to any kernel function. An alterna- nitude the computational costs. Overall, the MAP tive strategy is presented in (Filice et al., 2015b), obtained by the proposed approach is still higher where a cascade of kernel-based classifiers is pro- than the one achieved by all the other systems of posed according to the computational cost of their the challenge, including ConvKN (Barrón-Cedeño kernel functions, so that more complex classifiers et al., 2016) and SemanticZ (Mihaylov and Nakov, are invoked only on difficult instances. Their solu- 2016), i.e., the second and third best systems, re- tion is strictly connected to the availability of mul- spectively. tiple kernels that have to be sorted according to their complexity and expressiveness. Usually, it is 4 Related Work hard to define many kernels for a given task, and Improving the efficiency of kernel-based methods consequently only few layers can be set. is a largely studied topic. The reduction of com- 5 Conclusion 3 https://github.com/SAG-KeLP 4 The classification time was more than 3 hours and a half This paper discussed the application of Nyström on a standard machine with 4 cores i7-2600 3.4 GHz. method for a significant reduction of computa- tional costs in kernel-based classifications in the Petros Drineas and Michael W. Mahoney. 2005. On cQA task. By projecting examples into low- the nystrm method for approximating a gram matrix for improved kernel-based learning. Journal of ML dimensional embeddings, Nyström enables the Research, 6. adoption of efficient linear classifier, and drasti- cally reduces the overall computational cost. Ex- Simone Filice, Giuseppe Castellucci, Danilo Croce, perimental results demonstrate that the proposed and Roberto Basili. 2014. Effective kernelized on- approach leads to a cost reduction higher than line learning in language processing tasks. In Pro- ceedings of ECIR 2014, pages 347–358. 90%, with a negligible performance drop. Future research will be devoted to the definition of a prin- Simone Filice, Giuseppe Castellucci, Danilo Croce, cipled strategy to estimate the optimal number of and Roberto Basili. 2015a. Kelp: a kernel-based layers, as well as the size of embeddings at each learning platform for natural language processing. layer. In Proceedings of ACL: System Demonstrations, Beijing, China, July. Simone Filice, Danilo Croce, and Roberto Basili. References 2015b. A Stratified Strategy for Efficient Kernel- Alberto Barrón-Cedeño, Giovanni Da San Martino, based Learning. In AAAI Conference on Artificial Shafiq Joty, Alessandro Moschitti, Fahad Al- Intelligence. Obaidli, Salvatore Romeo, Kateryna Tymoshenko, and Antonio Uva. 2016. Convkn at semeval-2016 Simone Filice, Danilo Croce, Alessandro Moschitti, task 3: Answer and question selection for question and Roberto Basili. 2016. Kelp at semeval-2016 answering on arabic and english fora. In Proceed- task 3: Learning semantic relations between ques- ings of the 10th International Workshop on Seman- tions and answers. In Proceedings of the 10th tic Evaluation (SemEval-2016), pages 896–903, San International Workshop on Semantic Evaluation Diego, California, June. Association for Computa- (SemEval-2016), pages 1116–1123, San Diego, Cal- tional Linguistics. ifornia, June. Association for Computational Lin- guistics. Nicola Cancedda, Éric Gaussier, Cyril Goutte, and Jean-Michel Renders. 2003. Word-sequence ker- Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, nels. Journal of Machine Learning Research, S. Sathiya Keerthi, and S. Sundararajan. 2008. A 3:1059–1082. dual coordinate descent method for large-scale lin- ear svm. In Proceedings of the ICML 2008, pages Nicolo’ Cesa-Bianchi and Claudio Gentile. 2006. 408–415, New York, NY, USA. ACM. Tracking the best hyperplane with a simple budget perceptron. In In proc. of the nineteenth annual con- Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. ference on Computational Learning Theory, pages 2012. Sampling methods for the nyström method. 483–498. Springer-Verlag. J. Mach. Learn. Res., 13:981–1006, April. Chih-Chung Chang and Chih-Jen Lin. 2011. Libsvm: A library for support vector machines. ACM Trans. Todor Mihaylov and Preslav Nakov. 2016. Semanticz Intell. Syst. Technol., 2(3):27:1–27:27, May. at semeval-2016 task 3: Ranking relevant answers in community question answering using semantic simi- Michael Collins and Nigel Duffy. 2001. Convolution larity based on fine-tuned word embeddings. In Pro- kernels for natural language. In Proceedings of Neu- ceedings of the 10th International Workshop on Se- ral Information Processing Systems (NIPS’2001), mantic Evaluation (SemEval-2016), pages 879–886, pages 625–632. San Diego, California, June. Association for Com- putational Linguistics. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. 2006. Online Alessandro Moschitti, Silvia Quarteroni, Roberto passive-aggressive algorithms. Journal of Machine Basili, and Suresh Manandhar. 2007. Exploit- Learning Research, 7:551–585, December. ing syntactic and shallow semantic kernels for Danilo Croce and Roberto Basili. 2016. Large-scale question/answer classification. In Proceedings of kernel-based language learning through the ensem- ACL’07. ble nystrom methods. In Advances in Informa- tion Retrieval - 38th European Conference on IR Preslav Nakov, Lluı́s Màrquez, Alessandro Moschitti, Research, ECIR 2016, Padua, Italy, March 20-23, Walid Magdy, Hamdy Mubarak, Abed Alhakim 2016. Proceedings, pages 100–112. Freihat, Jim Glass, and Bilal Randeree. 2016. SemEval-2016 task 3: Community question answer- Ofer Dekel and Yoram Singer. 2006. Support vector ing. In Proceedings of the 10th International Work- machines on a budget. In Bernhard Schlkopf, John shop on Semantic Evaluation, SemEval ’16, San Platt, and Thomas Hoffman, editors, NIPS, pages Diego, California, June. Association for Computa- 345–352. MIT Press. tional Linguistics. Francesco Orabona, Joseph Keshet, and Barbara Ca- puto. 2008. The projectron: a bounded kernel-based perceptron. In Proceedings of ICML ’08, pages 720–727, USA. ACM. Aliaksei Severyn, Massimo Nicosia, and Alessandro Moschitti. 2013. Building structures from classi- fiers for passage reranking. In Proceedings of the 22nd ACM international Conference on Informa- tion and Knowledge Management, CIKM ’13, pages 969–978, New York, NY, USA. ACM. John Shawe-Taylor and Nello Cristianini. 2004. Ker- nel Methods for Pattern Analysis. Cambridge Uni- versity Press, New York, NY, USA. Vladimir N. Vapnik. 1998. Statistical Learning The- ory. Wiley-Interscience. Andrea Vedaldi and Andrew Zisserman. 2012. Ef- ficient additive kernels via explicit feature maps. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(3). Zhuang Wang and Slobodan Vucetic. 2010. Online passive-aggressive algorithms on a budget. Journal of Machine Learning Research - Proceedings Track, 9:908–915. Christopher K. I. Williams and Matthias Seeger. 2001. Using the nyström method to speed up kernel ma- chines. In Proceedings of NIPS 2000. Fabio Massimo Zanzotto and Lorenzo Dell’Arciprete. 2012. Distributed tree kernels. In Proceedings of ICML 2012.