A Markovian Kernel-based Approach for itaLIan Speech acT labEliNg

                              Danilo Croce and Roberto Basili
                               University of Roma, Tor Vergata
                           Via del Politecnico 1, Rome, 00133, Italy
                               {croce,basili}@info.uniroma2.it


                 Abstract                        1   Introduction

 English.       This paper describes the         A dialogue agent is designed to interact and com-
 UNITOR system that participated to the          municate with other agents, in a coherent manner,
 itaLIan Speech acT labEliNg task within         not just through one-shot messages, but according
 the context of EvalIta 2018. A Struc-           to sequences of meaningful and related messages
 tured Kernel-based Support Vector Ma-           on an underlying topic or in support to an overall
 chine has been here applied to make the         goal (Traum, 1999). These communications are
 classification of the dialogue turns sensi-     seen not just as transmitting information but as ac-
 tive to the syntactic and semantic informa-     tions that change the state of the world, e.g., the
 tion of each utterance, without relying on      mental states of the agents involved in the con-
 any task-specific manual feature engineer-      versation, as well as the state, or context, of the
 ing. Moreover, a specific Markovian for-        dialogue. In other words, speech act theory al-
 mulation of the SVM is adopted, so that         lows to design an agent in order to place its com-
 the labeling of each utterance depends on       munication within the same general framework as
 speech acts assigned to the previous turns.     the agent’s actions. In such a context, the robust
 The UNITOR system ranked first in the           recognition of the speech acts characterizing an in-
 competition, suggesting that the combina-       teraction is crucial for the design and deployment
 tion of the adopted structured kernel and       of artificial dialogue agents.
 the Markovian modeling is beneficial.              This specific task has been considered in the
                                                 first itaLIan Speech acT labEliNg task at EvalIta
 Italian. Questo lavoro descrive il sistema      (iLISTEN, (Novielli and Basile, 2018)): given a
 UNITOR che ha partecipato all’itaLIan           dialogue between an agent and a user, the task
 Speech acT labEliNg task organizzato            consists in automatically annotating the user’s di-
 nell’ambito di EvalIta 2018.         Il sis-    alogue turns with speech act labels, i.e. with the
 tema è basato su una Structured Kernel-        communicative intention of the speaker, such as
 based Support Vector Machine (SVM) che          statement, request for information or agreement.
 rende la classificazione dei turni di dial-     Table 1 reports the full set of speech act labels con-
 ogo dipendente dalle informazioni sintat-       sidered in the challenge, with definition and exam-
 tiche e semantiche della frase, evitando la     ples.
 progettazione di alcuna feature specifica          In this paper, the UNITOR system participat-
 per il task. Una specifica formulazione         ing in the iLISTEN task within the EvalIta 2018
 Markoviana dell’algoritmo di apprendi-          evaluation campaign is described. The system re-
 mento SVM permette inoltre di etichettare       alize the classification task through a Structured
 ciascun turno in funzione delle classifi-       Kernel-based Support Vector Machine (Vapnik,
 cazioni dei turni precedenti. Il sistema        1998) classifier. A structured kernel, namely a
 UNITOR si é classificato al primo posto        Smoothed Partial Tree Kernel (SPTK, (Croce et
 nella competizione, e questo conferma           al., 2011)) is applied in order to make the classifi-
 come la combinazione della funzione ker-        cation of each utterance dependent from the syn-
 nel e del modello Markoviano adottati sia       tactic and semantic information of each individual
 molto utile allo sviluppo di sistemi di di-     utterance.
 aloghi robusti.                                    Since turns are not observed in isolation, but im-
                                                 mersed in a dialogue, we adopted a Markovian for-
    Speech Act              Description                                    Example
    OPENING                 Dialogue opening or self-introduction          Ciao, io sono Antonella
    CLOSING                 Dialogue closing, e.g. farewell, wishes, in-   Va bene, ci vediamo prossimamente
                            tention to close the conversation
    INFO-REQUEST            Utterances that are pragmatically, semanti-    E cosa mi dici delle vitamine?
                            cally, and syntactically questions
    SOLICIT-REQ-CLARIF      Request for clarification (please explain)     Mmm, si ma in che senso?
                            or solicitation of system reaction
    STATEMENT               Descriptive, narrative, personal statements    Devo controllare maggiormente il mio peso.
    GENERIC-ANSWER          Generic answer                                 Si, No, Non so
    AGREE                   Expression of agreement, e.g. acceptance       Si, so che importante.
                            of a proposal, plan or opinion
    REJECT                  Expression of disagreement, e.g. rejection     Ho sentito tesi contrastanti al proposito.
                            of a proposal, plan, or opinion
    KIND-ATT-SMALLTALK      Expression of kind attitude through po-        Grazie,Sei per caso offesa per qualcosa che
                            liteness, e.g. thanking, apologizing or        ho detto?
                            smalltalk

                       Table 1: Full set of speech act labels considered at iLISTEN


mulation of SVM, namely SVMhmm (Altun et al.,                 i-th utterance. Given the corresponding sequence
2003) so that the classification of the ith utterance         of expected labels y = (y1 , . . . , ys ), a sequence of
also depends from the dialogue act assigned at the            m step-specific labels (from a dictionary of d dif-
i − 1th utterance.                                            ferent dialogue acts) can be retrieved, in the form
   The UNITOR system ranked first in the com-                 yi−m , . . . , yi−1 .
petition, suggesting that the combination of the                 In order to make the classification of xi depen-
adopted structured kernel and the Markovian                   dent also from the previous decisions, we aug-
learning algorithm is beneficial.                             ment the feature vector of xi introducing a pro-
   In the rest of the paper, Section 2 describes the          jection function ψm (xi ) ∈ Rmd that associates to
adopted machine learning method and the under-                each example a md−dimensional feature vector
lying semantic kernel functions. In Section 3, the            where each dimension set to 1 corresponds to the
performance measures of the system are reported               presence of one of the d possible labels observed
while Section 4 derives the conclusions.                      in a history of length m, i.e. m steps before the
                                                              target element xi .
2     A Markovian Kernel-based Approach
                                                                 In order to apply a SVM, a projection function
The UNITOR system implements a Marko-                         φm (·) can be defined to consider both the observa-
vian formulation of the Support Vector Machine                tions xi and the transitions ψm (xi ) by concatenat-
(SVM) learning algorithm. The SVM adopts a                    ing the two representations as follows:
structured kernel function in order to estimate the
syntactic and semantic similarity between utter-
                                                                               φm (xi ) = xi || ψm (xi )
ances in a dialogue, without the need of any task-
specific manual feature engineering (only the de-
pendency parse of each sentence is required). In              with φm (xi ) ∈ Rn+md . Notice that the symbol
the rest of this section, first the learning algorithm        || here denotes the vector concatenation, so that
is presented, then the adopted kernel method is               ψm (xi ) does not interfere with the original feature
discussed.                                                    space, where xi lies.
                                                                 Kernel-based methods can be applied in order to
2.1     A Markovian Support Vector Machine                    model meaningful representation spaces, encod-
The aim of a Markovian formulation of SVM is to               ing both the feature representing individual exam-
make the classification of a input example xi ∈ Rn            ples together with the information about the transi-
(belonging to a sequence of examples) dependent               tions. According to kernel-based learning (Shawe-
on the label assigned to the previous elements in a           Taylor and Cristianini, 2004), we can define a ker-
history of length m, i.e., xi−m , . . . , xi−1 .              nel function Km (xi , zj ) between a generic item of
   In our classification task, a dialogue is a se-            a sequence xi and another generic item zj from the
quence of utterances x = (x1 , . . . , xs ) each of           same or a different sequence, parametric in the his-
them representing an example xi , i.e., the specific          tory length m: it surrogates the product between
                                                                               ROOT
φm (·) such that:                                                                                  OBJ

                                                                                                               DET

       Km (xi , zj ) = φm (xi )φm (zj ) =                                AUX          ADVMOD                    DET: POSS

            obs                    tr
                                                             
       =K         (xi , zj ) + K        ψm (xi ), ψm (zj )
                                                                   Devo controllare maggiormente il  mio peso
In other words, we define a kernel that is the lin-                AUX    VERB          ADV      DET DET NOUN
ear combination of two further kernels: K obs op-
erating over the individual examples xi and a K tr               Figure 1: Dependency Parse Tree of “Devo con-
operating over the feature vectors encoding the in-              trollare maggiormente il mio peso” (In English:
volved transitions. It is worth noticing that K obs              “I need to control my weight more” )
does not depend on the position nor the context of
individual examples in line with Markov assump-
tion characterizing a large class of these generative               A natural approach to exploit such linguistic
models, e.g. HMM. For simplicity, we define K tr                 information consists in applying kernel methods
as a linear kernel between input instances, i.e. a               (Robert Müller et al., 2001; Shawe-Taylor and
dot-product in the space generated by ψm (·):                    Cristianini, 2004) on structured representations of
                                                                 data objects, e.g., documents. A sentence s can
                                                                 be represented as a parse tree2 that expresses the
  Km (xi , zj ) = K obs (xi , xj ) + ψm (xi )ψm (zj )            grammatical relations implied by s. Tree kernels
                                                                 (TKs) (Collins and Duffy, 2001) can be employed
   At training time, we use the kernel-based
                                                                 to directly operate on such parse trees, evaluating
SVM in a One-Vs-All schema over the feature
                                                                 the tree fragments shared by the input trees. This
space derived by Km (·, ·). The learning pro-
                                                                 operation corresponds to a dot product in the im-
cess provides a family of classification functions
                                                                 plicit feature space of all possible tree fragments.
f (xi ; m) ⊂ Rn+md × Rd , which associate each
                                                                    Whenever the dot product is available in the
xi to a distribution of scores with respect to the
                                                                 implicit feature space, kernel-based learning
different d labels, depending on the context size
                                                                 algorithms, such as SVMs, can operate in order to
m. At classification time, all possible sequences
                                                                 automatically generate robust prediction models.
y ∈ Y + should be considered in order to deter-
                                                                 TKs thus allow to estimate the similarity among
mine the best labeling ŷ, where m is the size of
                                                                 texts, directly from sentence syntactic structures,
the history used to enrich xi , that is:
                                                                 that can be represented by parse trees. The
                                                                 underlying idea is that the similarity between
                          X
          ŷ = arg max{           f (xi ; m)}
                     y∈Y +    i=1...m                            two trees T1 and T2 can be derived from the
                                                                 number of shared tree fragments. Let the set
   In order to reduce the computational cost, a                  T = {t1 , t2 , . . . , t|T | } be the space of all the
Viterbi-like decoding algorithm is adopted1 . The                possible substructures and χi (n2 ) be an indicator
next section defines the kernel function K obs ap-               function that is equal to 1 if the target ti is rooted
plied to specific utterances.                                    at the node n2 and 0 otherwise. A tree-kernel
2.2     Structured Kernel Methods for Speech                     function over TP      1 and T2P   is defined as follows:
        Act Labeling                                             T K(T1 , T2 ) = n1 ∈NT             n2 ∈NT2 ∆(n1 , n2 )
                                                                                                1
                                                                 where NT1 and NT2 are the sets of
Several NLP tasks require the explorations of
                                                                 nodes of T1 and T2 respectively, and
complex semantic and syntactic phenomena.                                        P|T |
For instance, in Paraphrase Detection, verifying                 ∆(n1 , n2 ) = k=1 χk (n1 )χk (n2 ) which com-
whether two sentences are valid paraphrases in-                  putes the number of common fragments between
volves the analysis of some rewriting rules in                   trees rooted at nodes n1 and n2 . The feature space
which the syntax plays a fundamental role. In                    generated by the structural kernels obviously
Question Answering, the syntactic information is                 depends on the input structures. Notice that
crucial, as largely demonstrated in (Croce et al.,               different tree representations embody different
2011).                                                           linguistic theories and may produce more or less
                                                                 effective syntactic/semantic feature spaces for a
   1
     When applying f (xi ; m) the classification scores are
                                                                    2
normalized through a softmax function and probability scores          Parse trees can be extracted using automatic parsers. In
are derived.                                                     our experiments, we used SpaCy https://spacy.io/.
given task.                                                              Capitalizing lexical information in Convolution
   Dependency grammars produce a significantly                           Kernels. The tree kernels introduced above per-
different representation which is exemplified in                         form a hard match between nodes when compar-
Figure 1. Since tree kernels are not tailored to                         ing two substructures. In NLP tasks, when nodes
model the labeled edges that are typical of de-                          are words, this strict requirement reflects in a too
pendency graphs, these latter are rewritten into                         strict lexical constraint, that poorly reflects seman-
explicit hierarchical representations. Different                         tic phenomena, such as the synonymy of differ-
rewriting strategies are possible, as discussed in                       ent words or the polysemy of a lexical entry. To
(Croce et al., 2011): a representation that is shown                     overcome this limitation, we adopt Distributional
to be effective in several tasks is the Grammatical                      models of Lexical Semantics (Sahlgren, 2006;
Relation Centered Tree (GRCT) illustrated in Fig-                        Mikolov et al., 2013) to generalize the meaning
ure 2: the PoS-Tags are children of grammatical                          of individual words by replacing them with ge-
function nodes and direct ancestors of their asso-                       ometrical representations (also called Word Em-
ciated lexical items.                                                    beddings) that are automatically derived from the
                                                                         analysis of large-scale corpora. These representa-
                                ROOT
                                                                         tions are based on the idea that words occurring in
    AUX          VERB             ADVMOD               OBJ               the same contexts tend to have similar meaning:
    AUX       controllare::v        ADV       DET DET: POSS    NOUN      the adopted distributional models generate vec-
                                                                         tors that are similar when the associated words ex-
  dovere::a                    maggiormente::a DET    DET      peso::n
                                                                         hibit a similar usage in large-scale document col-
                                              il::d   mio::d             lections. Under this perspective, the distance be-
                                                                         tween vectors reflects semantic relations between
Figure 2: Grammatical Relation Centered Tree
                                                                         the represented words, such as paradigmatic rela-
(GRCT) of “Devo controllare maggiorment il mio
                                                                         tions, e.g., quasi-synonymy. These word spaces
peso”.
                                                                         allow to define meaningful soft matching between
   Different tree kernels can be defined accord-                         lexical nodes, in terms of the distance between
ing to the types of tree fragments considered in                         their representative vectors. As a result, it is pos-
the evaluation of the matching structures. In the                        sible to obtain more informative kernel functions
Subtree Kernel (Vishwanathan and Smola, 2002),                           which are able to capture syntactic and semantic
valid fragments are only the grammatically well                          phenomena through grammatical and lexical con-
formed and complete subtrees: every node in a                            straints.
subtree corresponds to a context free rule whose                            The Smoothed Partial Tree Kernel (SPTK) de-
left hand side is the node label and the right hand                      scribed in (Croce et al., 2011) exploits this idea
side is completely described by the node descen-                         extending the PTK formulation with a similarity
dants. Subset trees are exploited by the Sub-                            function σ between nodes:
set Tree Kernel (Collins and Duffy, 2001), which
is usually referred to as Syntactic Tree Kernel                          ∆SP T K (n1 , n2 ) = µλσ(n1 , n2 ) , if n1 and n2 are leaves
(STK); they are more general structures since their                                                       
                                                                         ∆SP T K (n1 , n2 ) = µσ(n1 , n2 ) λ2 +
leaves can be non-terminal symbols. The subset
trees satisfy the constraint that grammatical rules                                 X           ~     ~
                                                                                                             Y
                                                                                                               ~1 )
                                                                                                             l(I
                                                                                                                                                    
cannot be broken and every tree exhaustively rep-                        +                  λd(I1 )+d(I2 )            ∆SP T K cn1 (i1k ), cn2 (i2k )
                                                                             ~1 ,I
                                                                             I   ~2 :l(I
                                                                                       ~1 )=l(I
                                                                                              ~2 )           k=1
resents a CFG rule. Partial Tree Kernel (PTK)
                                                                                                                                                  (1)
(Moschitti, 2006) relaxes this constraint consider-
ing partial trees, i.e., fragments generated by the                      In the SPTK formulation, the similarity function
application of partial production rules (e.g. se-                        σ(n1 , n2 ) between two nodes n1 and n2 can be
quences of non terminal with gaps). The strict                           defined as follows:
constraint imposed by the STK may be problem-                               • if n1 and n2 are both lexical nodes, then
                                                                                                                  ~v ·~v
atic especially when the training dataset is small                            σ(n1 , n2 ) = σLEX (n1 , n2 ) = τ ~v n1 ~nv2 .
                                                                                                               k n1 kk n2 k
and only few syntactic tree configurations can be                             It is the cosine similarity between the word
observed. The Partial Tree Kernel (PTK) over-                                 vectors ~vn1 and ~vn2 associated with the la-
comes this limitation, and usually leads to higher                            bels of n1 and n2 , respectively. τ is called
accuracy, as shown in (Moschitti, 2006).                                      terminal factor and weighs the contribution
      of the lexical similarity to the overall kernel                 specific kernel (i.e. the contribution of the lexi-
      computation.                                                    cal nodes in the overall computation) and of the
   • else if n1 and n2 are nodes sharing the same                     SVM algorithm have been tuned via 10-cross fold
      label, then σ(n1 , n2 ) = 1.                                    validation over the training set. In the Markovian
   • else σ(n1 , n2 ) = 0.                                            SVM, a history of m = 1 previous steps allowed
In the challenge we adopt the SPTK in order to im-                    to achieve the best results during the parameteri-
plement the K obs function used in the Markovian                      zation step. Given the limited size of the dataset,
SVM. This kernel in fact has been showed very                         higher values of m led to sparse representation of
robust in the classification of (possibly short) sen-                 the transitions ψm that are not helpful. As a multi-
tences, such as in Question Classification (Croce                     classification task, results are reported in terms
et al., 2011) or Semantic Role Labeling (Croce et                     of precision (P), recall (R) and F1-score with re-
al., 2011; Croce et al., 2012).                                       spect to the gold standard, as reported in Table 3.
                                                                      These are averaged across each utterance (micro-
                                 #user     #system      #total        statistics) and per class (macro-statistics).
     Dataset      #dialogues
                                 turns       turns      turns
     train                40     1,097       1,119      2,216            Among the two submitted systems, UNITOR
     test                 20       479         492        971         (reported on top of the table) achieved best re-
     complete             60     1,576       1,611      3,187         sults, both considering micro and macro statistics,
     Table 2: Statistics about the iLISTEN dataset                    where a F1 of 0.733 and 0.653 are achieved, re-
                                                                      spectively. These results are higher with respect
                                                                      to the other participant (namely System2) and far
3         Experimental Results                                        higher than the baseline (that confirms the diffi-
                                                                      culty of the task).
In iLISTEN, the reference dataset includes the
transcriptions of 60 dialogues amounting to about                        Given the unbalanced number of examples
22, 000 words. The detailed statistics regarding di-                  for each class, UNITOR achieves higher results
alogues and turns in the train and test dataset are                   w.r.t. the micro statistics, while lower results are
reported in Table 2.                                                  achieved w.r.t. classes with a reduced number of
                                                                      examples. The confusion matrix reported in Ta-
                       Micro                     Macro                ble 4 shows that some recurrent misclassifications
    Run           P      R       F1        P        R         F1      (e.g. the STATEMENT class with respect to the
    UNITOR      .733   .733     .733     .681     .628       .653
    System2     .685   .685     .685     .608     .584       .596     REJECT class) need to be carefully addressed.
    Baseline    .340   .340     .340     .037     .111       .056     Clearly this is a very challenging task, also for
                                                                      the annotators, where the differences between the
          Table 3: Results of the UNITOR system                       speech act is not strictly defined: as for example,
                                                                      given the stimulus of the system “Bisognerebbe
   In the proposed classification workflow each ut-                   mangiare solo se si ha fame, ed aspettare che la
terance from the training/test material is processed                  digestione sia completata prima di assumere altri
through the SpaCy3 dependency parser whose out-                       cibi6 ” the answer “a volte il lavoro non mi per-
puts are automatically converted into GRCT struc-                     mette di mangiare con ritmi regolari!7 ” should
tures4 discussed in Section 2. These structures are                   be labeled as REJECT while the system provides
used within the Markovian SVM implemented in                          STATEMENT. Overall this results is straightfor-
KeLP5 (Filice et al., 2015; Filice et al., 2018).                     ward, also considering that the system did not
   The learning algorithm is based on a SPTK                          required any task specific feature modeling, but
combined with a One-Vs-All multi-classification                       the adopted structured kernel based method allows
schema adopted to assign individual utterances to                     capitalizing the syntactic and semantic informa-
the targeted classes. All the parameters of the                       tion useful for the task. The only requirement
      3                                                               of the system is the availability of a dependency
      It is freely available for several languages (including Ital-
ian) at https://spacy.io/                                             parser.
    4
      Utterances may include more than one sentence and po-
                                                                         6
tentially generate different trees. These cases are handled as             Translated in English: “You should only eat if you are
follows: all trees after the first one are linked through the cre-    hungry, and wait until digestion is complete before eating
ation of an artificial link between their roots and the root of       again.”
the tree generated by the first sentence.                                7
                                                                           Translated in English: “sometimes work doesn’t allow
    5
      http://www.kelp-ml.org/?page_id=215                             me to eat at a regular pace!”
                 STATEMENT   KIND-ATT.   GEN.-ANSW.   REJECT   CLOSING   SOL.-CLAR.    OPENING   AGREE    INFO-REQ.
    STATEMENT       153           6           3         24        0           3            0        2         13
    KIND-ATT.        4           17           0          5        1           2           0         3          2
    GEN.-ANSW.       1           0           48          0        0           1           0        6           0
    REJECT           0           3            0          3        0           0            0       0          1
    CLOSING          0           0            0          0        7           1           0         1          0
    SOL.-CLAR.       0           6            0          2        1           8           0        1           2
    OPENING          0           0            0          0        0           0           11       0          0
    AGREE            0           3            1          1        0           0            0       11          1
    INFO-REQ.        4           9            0          4        1           9            0       0          93

Table 4: Confusion Matrix of the UNITOR system w.r.t. gold standard. In column the number of classes
from the gold standard, while rows report the system decisions. In bold correct classifications.


4     Conclusions                                              Meeting of the Association for Computational Lin-
                                                               guistics (Volume 1: Long Papers), pages 263–272.
In this paper the description of the UNITOR sys-
                                                           Simone Filice, Giuseppe Castellucci, Danilo Croce,
tem participating to the iLISTEN task at EvalIta             and Roberto Basili.     2015.   Kelp: a kernel-
2018 has been provided. The system ranked first              based learning platform for natural language pro-
in all the evaluations. Thus, the proposed classifi-         cessing. In Proceedings of ACL-IJCNLP 2015 Sys-
cation strategy shows the beneficial impact of the           tem Demonstrations, pages 19–24.
combination of a structured kernel-based method            Simone Filice, Giuseppe Castellucci, Giovanni Da San
with a Markovian classifier, capitalizing the con-           Martino, Alessandro Moschitti, Danilo Croce, and
tribution of the dialogue modeling in deciding the           Roberto Basili. 2018. Kelp: a kernel-based learning
                                                             platform. Journal of Machine Learning Research,
speech act of individual sentences. One impor-               18(191):1–5.
tant finding of this evaluation is that a quite ro-
bust speech classifier can be obtained with al-            Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
                                                             Dean. 2013. Efficient estimation of word represen-
most no requirement in term of task-specific fea-            tations in vector space. CoRR, abs/1301.3781.
ture and system engineering: results are appeal-
ing mostly considering the reduced size of the             Alessandro Moschitti. 2006. Efficient convolution ker-
                                                             nels for dependency and constituent syntactic trees.
dataset. Further work is needed to improve the               In ECML, Berlin, Germany.
overall F1 scores, possibly extending the adopted
                                                           Nicole Novielli and Pierpaolo Basile. 2018. Overview
kernel function by addressing other dimensions
                                                             of the evalita 2018 italian speech act labeling (ilis-
of the linguistic information or also making the             ten) task. In Tommaso Caselli, Nicole Novielli,
kernel more sensitive to task-specific knowledge.            Viviana Patti, and Paolo Rosso, editors, Proceed-
Also the combination of the adopted strategy with            ings of the 6th evaluation campaign of Natural
recurrent neural approaches is an interesting re-            Language Processing and Speech tools for Italian
                                                             (EVALITA’18), Turin, Italy. CEUR.org.
search direction.
                                                           Klaus Robert Müller, Sebastian Mika, Gunnar Rätsch,
                                                             Koji Tsuda, and Bernhard Schölkopf. 2001. An
References                                                   introduction to kernel-based learning algorithms.
                                                             IEEE Transactions on Neural Networks, 12(2):181–
Y. Altun, I. Tsochantaridis, and T. Hofmann. 2003.           201.
   Hidden Markov support vector machines. In Pro-
   ceedings of the International Conference on Ma-         Magnus Sahlgren. 2006. The Word-Space Model.
   chine Learning.                                          Ph.D. thesis, Stockholm University.
                                                           John Shawe-Taylor and Nello Cristianini. 2004. Ker-
Michael Collins and Nigel Duffy. 2001. Convolution           nel Methods for Pattern Analysis. Cambridge Uni-
  kernels for natural language. In Proceedings of Neu-       versity Press.
  ral Information Processing Systems (NIPS’2001),
  pages 625–632.                                           David R. Traum, 1999. Speech Acts for Dialogue
                                                             Agents, pages 169–201. Springer Netherlands, Dor-
Danilo Croce, Alessandro Moschitti, and Roberto              drecht.
  Basili. 2011. Structured lexical similarity via con-
  volution kernels on dependency trees. In Proceed-        Vladimir N. Vapnik. 1998. Statistical Learning The-
  ings of EMNLP.                                             ory. Wiley-Interscience.
                                                           S.V.N. Vishwanathan and Alexander J. Smola. 2002.
Danilo Croce, Alessandro Moschitti, Roberto Basili,
                                                             Fast kernels on strings and trees. In Proceedings of
  and Martha Palmer. 2012. Verb classification us-
                                                             Neural Information Processing Systems, pages 569–
  ing distributional similarity in syntactic and seman-
                                                             576.
  tic structures. In Proceedings of the 50th Annual