Nyström Methods for Efficient Kernel-Based Methods
                  for Community Question Answering
                  Danilo Croce1 , Simone Filice2 and Roberto Basili1
                             1
                               Dept. of Enterprise Engineering
             2
               Dept. of Civil Engineering and Computer Science Engineering
                          University of Roma, Tor Vergata, Italy
                         {croce,filice,basili}@info.uniroma2.it


                Abstract                           estremamente efficienti e scalabili. In
                                                   questo lavoro si dimostra che é possibile
English. Expressive but complex kernel             applicare metodi kernel al problema di
functions, such as Sequence or Tree ker-           Community Question Answering, otte-
nels, are usually underemployed in NLP             nendo risultati che sono lo stato dell’arte,
tasks, e.g., in community Question An-             riducendo di ordini di grandezza i costi
swering (cQA), as for their significant            computazionali.
complexity in both learning and classi-
fication stages. Recently, the Nyström
methodology for data embedding has been        1   Introduction
proposed as a viable solution to scala-
bility problems. By mapping data into          Kernel methods (Shawe-Taylor and Cristianini,
low-dimensional approximations of kernel       2004) have been employed in several Machine
spaces, it positively increases scalability    Learning algorithms (Crammer et al., 2006; Vap-
through compact linear representations for     nik, 1998) achieving state-of-the-art performances
highly structured data. In this paper, we      in many classification tasks. Recently, the kernel
show that Nyström methodology can be          based approach presented in (Filice et al., 2016)
effectively used to apply a kernel-based       has been applied in the community Question An-
method in the cQA task, achieving state-       swering (cQA) challenge at SemEval 2016 (Nakov
of-the-art results by reducing the compu-      et al., 2016) obtaining state-of-the-art results.
tational cost of orders of magnitude.             Unfortunately, when large data volumes are in-
                                               volved, time and space complexity required in
Italiano.      Metodi di apprendimento         learning and classification may prevent the adop-
automatico basato su funzioni ker-             tion of expressive but complex kernel functions,
nel complesse, come Sequence o Tree            such as Sequence (Cancedda et al., 2003) or Tree
Kernel, rischiano di non poter essere          kernels (Collins and Duffy, 2001). In particular,
adeguatamente utilizzati in problemi           the classification cost required by a kernel-based
legati all’elaborazione del linguaggio         model crucially depends on its number of support
naturale (come ad esempio in Community         vectors: classifying a new instance requires a ker-
Question Answering) a causa degli alti         nel computation against all support vectors. This
costi computazionali per l’addestramento       scalability issue is evident in many NLP and IR ap-
e la classificazione. Recentemente é stata    plications, such as in re-ranking answers in ques-
proposta una metodologia, basata sul           tion answering (Moschitti et al., 2007; Severyn et
metodo di Nyström, per poter far fronte a     al., 2013; Filice et al., 2016), where the number of
questi problemi di scalabilitá: essa per-     support vectors is typically very large.
mette di proiettare gli esempi, osservabili       Some approaches have been defined to bound
in fase di addestramento e classificazione,    the complexity of kernel-based methods, such as
all’interno di spazi a bassa dimensionalitá   (Wang and Vucetic, 2010; Vedaldi and Zisserman,
che approssimano lo spazio sottostante         2012; Filice et al., 2014), but they are still specific
la funzione kernel. Queste rappresen-          to kernel formulations and learning algorithms.
tazioni compatte permettono di applicare          In (Croce and Basili, 2016) it has been shown
algoritmi di apprendimento automatico          that a viable and more general solution to the
above scalability issues is the Nyström method-          dimensions is proportional to the number of possi-
ology, a dimensionality reduction technique that          ble sub-trees in a Natural Language. Kernel func-
has been applied also in kernel-based methods             tions are exploited by kernel-based learning algo-
since (Williams and Seeger, 2001). This method-           rithms, such as SVM (Vapnik, 1998), to operate on
ology has been designed to approximate the Gram           the implicit kernel space without its explicit defi-
Matrix derivable by a kernel function, enabling           nition.
the projections of examples into low-dimensional             Let us assume that, given a kernel K, its ex-
spaces. The Nyström projection function is gen-          plicit projection function φ over D is available to
erated by using some examples called landmarks,           derive new representations ~xi being the rows of
whose number directly impacts on the embeddings           the resulting matrix X. We define the Gram Ma-
quality; dually, costs of projecting a new example        trix as G = XX > , with each single element corre-
in the embedding space rise linearly with the num-        sponding to Gij = Φ(oi )Φ(oj ) = K(oi , oj ). The
ber of landmarks, that is usually of orders of mag-       aim of the Nyström method is to derive a new low-
nitude lower with respect of the number of pos-           dimensional embedding in a l-dimensional space,
sible support vectors that can be derived from a          with l  n so that G ≈ G̃ = X̃ X̃ > . This is ob-
learning process. Once each example is projected          tained by generating an approximation of G using
in the dense low-dimensional space, the applica-          a subset of l columns of the matrix. This corre-
tion of efficient linear learning methods is enabled,     sponds to selecting a subset L of the available ex-
such as (Hsieh et al., 2008), preserving at the same      amples, called landmarks. Suppose we randomly
time the expressiveness and effectiveness of ker-         sample l columns of G, and let C be the n × l
nel methods. This approach is highly applicable           matrix of these sampled columns. Then, we can
to different input data as well as to different ker-      rearrange the columns and rows of G and define
nels or learning algorithms, as discussed in (Croce       X = [X1 X2 ] such that:
and Basili, 2016).
                                                                                             X1> X2
                                                                                                    
                                                                            >        W
   In this paper we show that the Nyström method                 G = XX =
                                                                                   X2> X1 X2> X2
can be effectively used in the cQA task, by adopt-                                        
ing the same kernel functions proposed in (Filice                                    W
                                                                     and C =                              (1)
et al., 2016) and obtaining the same results w.r.t.                                X2> X1
the metrics adopted in the SemEval task, by reduc-        where W = X1> X1 , i.e., the subset of G that only
ing the computational cost of orders of magnitude.        considers landmarks. The Nyström approximation
   In Section 2, we demonstrate the viability of          can be defined as:
the Nyström method to reduce the computational
costs of kernel machines. Experimental results                           G ≈ G̃ = CW † C >                (2)
(obtained by adopting efficient SVM learning over         where W † denotes the Moore-Penrose inverse of
the cQA task) are discussed in Section 3. Finally,        W . The Singular Value Decomposition (SVD)
Section 4 describes related work, while in Section        is used to obtain W † as it follows. First W is
5 conclusions are derived.                                decomposed so that W = U SV > where U and
                                                          V are both orthogonal matrices, and S is a di-
2   Linearizing linguistic properties                     agonal matrix containing the (non-zero) singular
    through Nyström Approach                             values of W on its diagonal. Since W is sym-
                                                          metric and positive definite W = U SU > . Then
Given an input training dataset oi ∈ D, a kernel                                     1   1
                                                          W † = U S −1 U > = U S − 2 S − 2 U > and the Equa-
function K(oi , oj ) is a similarity function that cor-
                                                          tion 2 can be rewritten as
responds to a dot product in the implicit kernel
                                                                              1       1
space, i.e., K(oi , oj ) = Φ(oi ) · Φ(oj ). The ad-         G ≈ G̃ = CU S − 2 S − 2 U > C >
vantage of kernels is that the projection function                                1       1
                                                                    = (CU S − 2 )(CU S − 2 )> = X̃ X̃ > (3)
Φ(oi ) = ~xi ∈ Rn is never explicitly computed
(Shawe-Taylor and Cristianini, 2004). In fact, this          Given an input example oi ∈ D, a new low-
operation may be prohibitive when the dimension-          dimensional representation ~x˜i can be thus deter-
ality n of the underlying kernel space is extremely       mined by considering the corresponding i-th item
large. For example, Tree Kernels (Collins and             of C as
Duffy, 2001) give rise to spaces whose number of                        ˜i = Θ(oi ) = c~i U S − 12
                                                                       ~x                                (4)
where c~i corresponds to a vector whose dimen-                  the new space, i.e., O kl). In fact, once a test ex-
sions contain the evaluation of the kernel func-                ample is projected, the final decision requires a dot
tion between oi and each landmark oj ∈ L. The                   product between the low-dimensional representa-
method produces l-dimensional vectors, and no re-                     ˜i and the hyperplane underlying the classi-
                                                                tion ~x
striction is applied to the input dataset as long as a          fication function: again, this is negligible with re-
valid K(oi , oj ) is used.                                      spect to the cost of the single kernel operations.
   Several policies have been defined to determine              Such cost is typically extremely lower than the
the best selection of landmarks to reduce the Gram              cost of a pure kernel-based classification, which
Matrix approximation error. In this work the uni-               requires a kernel operation againts all the support
form sampling without replacement is adopted, as                vectors selected during the training process, which
suggested by (Kumar et al., 2012), where this pol-              is usually far larger than the number of landmarks.
icy has been theoretically and empirically shown
to achieve results comparable with other (more                  3   Empirical Investigation: the
complex) selection policies.                                        Community QA task
   Assuming that k is the computational cost1 of
a single kernel operation, the runtime cost of the              The proposed stratified Nyström method has been
Nyström method is O(knl + l3 + nl2 ) as it de-                 applied in the SemEval-2016 community Ques-
pends on (i) the computation of the n × l ma-                   tion Answering (cQA) task. In this task, partici-
trix C, i.e., O(knl); (ii) the SVD evaluation on                pants are asked to automatically provide good an-
W , which is O(l3 ); and (iii) the projection of                swers in a community question answering setting
the entire dataset through the multiplication by C,             (Nakov et al., 2016).
i.e., O(nl2 ). For several classes of kernels, such                In particular, we focused on the Subtask A:
as Tree or Sequence Kernels (Collins and Duffy,                 given a question and a large collection of question-
2001), the kernel computation cost is extremely                 comment threads created by a user community,
high. Therefore, the computational cost for the                 the task consists in (re-)ranking comments that are
construction of the matrix C dominates the overall              most useful for answering the question. This task
expression.                                                     is interesting as kernel methods achieved the high-
   Once an example is projected in the l-                       est results in the cQA task, as demonstrated by
dimensional space, efficient and large-scale learn-             the KeLP team (Filice et al., 2016). In particu-
ing algorithm can be applied. To further control                lar, Subtask A is modeled as a binary classification
the computational cost of the training step, we ad-             problem, where examples are generated by con-
dressed a class of algorithms that bounds the num-              sidering (question,comment) pairs. Each pair gen-
ber of times a single instance is re-used during                erates an example for a binary SVM, where the
training. In particular, we investigated the Dual               positive label is associated with a good comment
Coordinate Descent algorithm (Hsieh et al., 2008):              and the negative label includes the potential and
it is a batch learning algorithm whose achievable               bad comments. The classification score is used
accuracy is made inversely dependent on the num-                to sort the instances and produce the final rank-
ber of iterations T over a training dataset. Its train-         ing. According to the above setting, a train and
ing time cost on a dataset of n examples in Rl is               test dataset made of 20,340 and 3,270 examples
O(T nl). Being fixed the number of iterations re-               are generated. In (Filice et al., 2016), a Kernel-
quired to obtain an accurate model2 , such cost is              based SVM classifier achieved state-of-the-art re-
negligible w.r.t. the projection cost. Therefore, a             sults by adopting a kernel combination that ex-
complete training process exploiting the Nyström               ploited (i) feature vectors containing linguistic
method is simply O kln), that should be com-                    similarities between the texts in a pair; (ii) shallow
pared with a traditional kernel-based SVM learn-                syntactic trees that encode the lexical and morpho-
ing algorithm, e.g., (Chang and Lin, 2011), whose               syntactic information shared between text pairs;
computational cost is almost O kn2 ), with l  n.               (iii) feature vectors capturing task-specific infor-
   The computational cost of a classification step              mation.
only depends on the projection of the example in                   First, a batch kernel-based SVM (Chang and
   1                                                            Lin, 2011) learning algorithm operating on the
   Expressed in terms of basic operations, such as products.
   2
   In (Croce and Basili, 2016) a number of iterations           kernel function proposed in (Filice et al., 2016)
T = 30 obtained stable and accurate results in several tasks.   is adopted to determine the upper bound in terms
of classification quality (but with higher compu-
                                                               Table 1: Results in CQA. Upperbound is achieved
tational costs). Then, multiple standard Nyström
                                                               by a SVM with more than 37M kernel operations.
methods are used to linearize the dataset by sam-
                                                                     Landmarks MAP F1 Saving
pling different numbers of landmarks: 10 config-
                                                                         100         76.0 58.6 99.1%
urations have been investigated by starting from
                                                                         200         77.0 60.8 98.2%
100 landmarks and incrementally adding 100
                                                                         300         77.5 62.2 97.4%
landmarks at a time. The higher is the number of
                                                                         400         77.7 62.4 96.5%
used landmarks, the higher is the quality of the ap-
                                                                         500         77.9 63.1 95.6%
proximated low-dimensional space (Drineas and
                                                                         600         78.0 63.6 94.7%
Mahoney, 2005), but the higher is also the compu-
                                                                         700         78.1 63.7 93.8%
tational cost. The most complex projection func-
                                                                         800         78.0 63.8 92.9%
tion is thus based on 1,000 landmarks. Landmarks
                                                                         900         78.1 64.2 92.1%
have been selected by applying a random selec-
                                                                         1000        78.2 64.4 91.2%
tion without replacement, as suggested in (Kumar
                                                                    standard SVM 79.2 64.4           -
et al., 2012). An efficient linear SVM (Hsieh et
                                                                       ConvKN        77.7 66.2       -
al., 2008) is adopted on the resulting embedding
                                                                      SemanticZ      77.6 61.8       -
space. Experiments have been carried out by us-
ing the KeLP framework3 (Filice et al., 2015a).
                                                               putational costs has been early designed by im-
   Results are reported in Table 1 in terms of                 posing a budget in the number of support vec-
Mean Average Precision (MAP, the official rank                 tors (Cesa-Bianchi and Gentile, 2006; Dekel and
of the competition), F1 on the good class, and                 Singer, 2006; Orabona et al., 2008; Wang and
computational saving, i.e., percentage of avoided              Vucetic, 2010; Filice et al., 2014). However,
kernel operations in classification. The standard              in complicated tasks, such methods still require
SVM model contains 11,322 Support Vectors, thus                large budgets that systematically rely on many
requiring more than 37M kernel operations for                  kernel computations. They are thus less efficient
the complete classification of the 3,270 test in-              than Nyström: a classifier based on the Nyström
stances4 . By adopting the Nyström methodol-                  method with l landmarks has approximately the
ogy with only 1,000 landmarks the same F1 score                same computational complexity of its budgeted
(i.e., 64.4) is obtained. Moreover, a comparable               counterpart with a budget set to l, but its accuracy
MAP (i.e., 78.2%) achieved by the KeLP team is                 is typically higher, as shown in (Croce and Basili,
replicated with a 91.2% of saving. The speed                   2016). Alternatively, Zanzotto and Dell’Arciprete
up is impressive also when fewer landmarks are                 (2012) proposed Distributed Tree Kernels that ap-
used: with 300 landmarks, 77.7 MAP is obtained                 proximate tree kernels (Collins and Duffy, 2001)
by saving more that 97% of kernel computations.                through the explicit mapping of trees into vec-
These results are straightforward, considering that            tors. DTKs focus on specific tree kernel functions,
results comparable with the state-of-the-art can be            while the approach proposed here can be effec-
obtained by reducing of almost two orders of mag-              tively applied to any kernel function. An alterna-
nitude the computational costs. Overall, the MAP               tive strategy is presented in (Filice et al., 2015b),
obtained by the proposed approach is still higher              where a cascade of kernel-based classifiers is pro-
than the one achieved by all the other systems of              posed according to the computational cost of their
the challenge, including ConvKN (Barrón-Cedeño               kernel functions, so that more complex classifiers
et al., 2016) and SemanticZ (Mihaylov and Nakov,               are invoked only on difficult instances. Their solu-
2016), i.e., the second and third best systems, re-            tion is strictly connected to the availability of mul-
spectively.                                                    tiple kernels that have to be sorted according to
                                                               their complexity and expressiveness. Usually, it is
4       Related Work                                           hard to define many kernels for a given task, and
Improving the efficiency of kernel-based methods               consequently only few layers can be set.
is a largely studied topic. The reduction of com-
                                                               5   Conclusion
    3
        https://github.com/SAG-KeLP
    4
    The classification time was more than 3 hours and a half   This paper discussed the application of Nyström
on a standard machine with 4 cores i7-2600 3.4 GHz.            method for a significant reduction of computa-
tional costs in kernel-based classifications in the       Petros Drineas and Michael W. Mahoney. 2005. On
cQA task. By projecting examples into low-                  the nystrm method for approximating a gram matrix
                                                            for improved kernel-based learning. Journal of ML
dimensional embeddings, Nyström enables the
                                                            Research, 6.
adoption of efficient linear classifier, and drasti-
cally reduces the overall computational cost. Ex-         Simone Filice, Giuseppe Castellucci, Danilo Croce,
perimental results demonstrate that the proposed            and Roberto Basili. 2014. Effective kernelized on-
approach leads to a cost reduction higher than              line learning in language processing tasks. In Pro-
                                                            ceedings of ECIR 2014, pages 347–358.
90%, with a negligible performance drop. Future
research will be devoted to the definition of a prin-     Simone Filice, Giuseppe Castellucci, Danilo Croce,
cipled strategy to estimate the optimal number of           and Roberto Basili. 2015a. Kelp: a kernel-based
layers, as well as the size of embeddings at each           learning platform for natural language processing.
layer.                                                      In Proceedings of ACL: System Demonstrations,
                                                            Beijing, China, July.

                                                          Simone Filice, Danilo Croce, and Roberto Basili.
References                                                  2015b. A Stratified Strategy for Efficient Kernel-
Alberto Barrón-Cedeño, Giovanni Da San Martino,           based Learning. In AAAI Conference on Artificial
  Shafiq Joty, Alessandro Moschitti, Fahad Al-              Intelligence.
  Obaidli, Salvatore Romeo, Kateryna Tymoshenko,
  and Antonio Uva. 2016. Convkn at semeval-2016           Simone Filice, Danilo Croce, Alessandro Moschitti,
  task 3: Answer and question selection for question        and Roberto Basili. 2016. Kelp at semeval-2016
  answering on arabic and english fora. In Proceed-         task 3: Learning semantic relations between ques-
  ings of the 10th International Workshop on Seman-         tions and answers. In Proceedings of the 10th
  tic Evaluation (SemEval-2016), pages 896–903, San         International Workshop on Semantic Evaluation
  Diego, California, June. Association for Computa-         (SemEval-2016), pages 1116–1123, San Diego, Cal-
  tional Linguistics.                                       ifornia, June. Association for Computational Lin-
                                                            guistics.
Nicola Cancedda, Éric Gaussier, Cyril Goutte, and
  Jean-Michel Renders. 2003. Word-sequence ker-           Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin,
  nels.  Journal of Machine Learning Research,              S. Sathiya Keerthi, and S. Sundararajan. 2008. A
  3:1059–1082.                                              dual coordinate descent method for large-scale lin-
                                                            ear svm. In Proceedings of the ICML 2008, pages
Nicolo’ Cesa-Bianchi and Claudio Gentile. 2006.             408–415, New York, NY, USA. ACM.
  Tracking the best hyperplane with a simple budget
  perceptron. In In proc. of the nineteenth annual con-
                                                          Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar.
  ference on Computational Learning Theory, pages
                                                            2012. Sampling methods for the nyström method.
  483–498. Springer-Verlag.
                                                            J. Mach. Learn. Res., 13:981–1006, April.
Chih-Chung Chang and Chih-Jen Lin. 2011. Libsvm:
  A library for support vector machines. ACM Trans.       Todor Mihaylov and Preslav Nakov. 2016. Semanticz
  Intell. Syst. Technol., 2(3):27:1–27:27, May.             at semeval-2016 task 3: Ranking relevant answers in
                                                            community question answering using semantic simi-
Michael Collins and Nigel Duffy. 2001. Convolution          larity based on fine-tuned word embeddings. In Pro-
  kernels for natural language. In Proceedings of Neu-      ceedings of the 10th International Workshop on Se-
  ral Information Processing Systems (NIPS’2001),           mantic Evaluation (SemEval-2016), pages 879–886,
  pages 625–632.                                            San Diego, California, June. Association for Com-
                                                            putational Linguistics.
Koby Crammer, Ofer Dekel, Joseph Keshet, Shai
  Shalev-Shwartz, and Yoram Singer. 2006. Online          Alessandro Moschitti, Silvia Quarteroni, Roberto
  passive-aggressive algorithms. Journal of Machine         Basili, and Suresh Manandhar. 2007. Exploit-
  Learning Research, 7:551–585, December.                   ing syntactic and shallow semantic kernels for
Danilo Croce and Roberto Basili. 2016. Large-scale          question/answer classification. In Proceedings of
  kernel-based language learning through the ensem-         ACL’07.
  ble nystrom methods. In Advances in Informa-
  tion Retrieval - 38th European Conference on IR         Preslav Nakov, Lluı́s Màrquez, Alessandro Moschitti,
  Research, ECIR 2016, Padua, Italy, March 20-23,           Walid Magdy, Hamdy Mubarak, Abed Alhakim
  2016. Proceedings, pages 100–112.                         Freihat, Jim Glass, and Bilal Randeree. 2016.
                                                            SemEval-2016 task 3: Community question answer-
Ofer Dekel and Yoram Singer. 2006. Support vector           ing. In Proceedings of the 10th International Work-
  machines on a budget. In Bernhard Schlkopf, John          shop on Semantic Evaluation, SemEval ’16, San
  Platt, and Thomas Hoffman, editors, NIPS, pages           Diego, California, June. Association for Computa-
  345–352. MIT Press.                                       tional Linguistics.
Francesco Orabona, Joseph Keshet, and Barbara Ca-
  puto. 2008. The projectron: a bounded kernel-based
  perceptron. In Proceedings of ICML ’08, pages
  720–727, USA. ACM.
Aliaksei Severyn, Massimo Nicosia, and Alessandro
  Moschitti. 2013. Building structures from classi-
  fiers for passage reranking. In Proceedings of the
  22nd ACM international Conference on Informa-
  tion and Knowledge Management, CIKM ’13, pages
  969–978, New York, NY, USA. ACM.
John Shawe-Taylor and Nello Cristianini. 2004. Ker-
  nel Methods for Pattern Analysis. Cambridge Uni-
  versity Press, New York, NY, USA.
Vladimir N. Vapnik. 1998. Statistical Learning The-
  ory. Wiley-Interscience.
Andrea Vedaldi and Andrew Zisserman. 2012. Ef-
  ficient additive kernels via explicit feature maps.
  Pattern Analysis and Machine Intelligence, IEEE
  Transactions on, 34(3).
Zhuang Wang and Slobodan Vucetic. 2010. Online
  passive-aggressive algorithms on a budget. Journal
  of Machine Learning Research - Proceedings Track,
  9:908–915.
Christopher K. I. Williams and Matthias Seeger. 2001.
  Using the nyström method to speed up kernel ma-
  chines. In Proceedings of NIPS 2000.
Fabio Massimo Zanzotto and Lorenzo Dell’Arciprete.
  2012. Distributed tree kernels. In Proceedings of
  ICML 2012.