A Markovian Kernel-based Approach for itaLIan Speech acT labEliNg Danilo Croce and Roberto Basili University of Roma, Tor Vergata Via del Politecnico 1, Rome, 00133, Italy {croce,basili}@info.uniroma2.it Abstract 1 Introduction English. This paper describes the A dialogue agent is designed to interact and com- UNITOR system that participated to the municate with other agents, in a coherent manner, itaLIan Speech acT labEliNg task within not just through one-shot messages, but according the context of EvalIta 2018. A Struc- to sequences of meaningful and related messages tured Kernel-based Support Vector Ma- on an underlying topic or in support to an overall chine has been here applied to make the goal (Traum, 1999). These communications are classification of the dialogue turns sensi- seen not just as transmitting information but as ac- tive to the syntactic and semantic informa- tions that change the state of the world, e.g., the tion of each utterance, without relying on mental states of the agents involved in the con- any task-specific manual feature engineer- versation, as well as the state, or context, of the ing. Moreover, a specific Markovian for- dialogue. In other words, speech act theory al- mulation of the SVM is adopted, so that lows to design an agent in order to place its com- the labeling of each utterance depends on munication within the same general framework as speech acts assigned to the previous turns. the agent’s actions. In such a context, the robust The UNITOR system ranked first in the recognition of the speech acts characterizing an in- competition, suggesting that the combina- teraction is crucial for the design and deployment tion of the adopted structured kernel and of artificial dialogue agents. the Markovian modeling is beneficial. This specific task has been considered in the first itaLIan Speech acT labEliNg task at EvalIta Italian. Questo lavoro descrive il sistema (iLISTEN, (Novielli and Basile, 2018)): given a UNITOR che ha partecipato all’itaLIan dialogue between an agent and a user, the task Speech acT labEliNg task organizzato consists in automatically annotating the user’s di- nell’ambito di EvalIta 2018. Il sis- alogue turns with speech act labels, i.e. with the tema è basato su una Structured Kernel- communicative intention of the speaker, such as based Support Vector Machine (SVM) che statement, request for information or agreement. rende la classificazione dei turni di dial- Table 1 reports the full set of speech act labels con- ogo dipendente dalle informazioni sintat- sidered in the challenge, with definition and exam- tiche e semantiche della frase, evitando la ples. progettazione di alcuna feature specifica In this paper, the UNITOR system participat- per il task. Una specifica formulazione ing in the iLISTEN task within the EvalIta 2018 Markoviana dell’algoritmo di apprendi- evaluation campaign is described. The system re- mento SVM permette inoltre di etichettare alize the classification task through a Structured ciascun turno in funzione delle classifi- Kernel-based Support Vector Machine (Vapnik, cazioni dei turni precedenti. Il sistema 1998) classifier. A structured kernel, namely a UNITOR si é classificato al primo posto Smoothed Partial Tree Kernel (SPTK, (Croce et nella competizione, e questo conferma al., 2011)) is applied in order to make the classifi- come la combinazione della funzione ker- cation of each utterance dependent from the syn- nel e del modello Markoviano adottati sia tactic and semantic information of each individual molto utile allo sviluppo di sistemi di di- utterance. aloghi robusti. Since turns are not observed in isolation, but im- mersed in a dialogue, we adopted a Markovian for- Speech Act Description Example OPENING Dialogue opening or self-introduction Ciao, io sono Antonella CLOSING Dialogue closing, e.g. farewell, wishes, in- Va bene, ci vediamo prossimamente tention to close the conversation INFO-REQUEST Utterances that are pragmatically, semanti- E cosa mi dici delle vitamine? cally, and syntactically questions SOLICIT-REQ-CLARIF Request for clarification (please explain) Mmm, si ma in che senso? or solicitation of system reaction STATEMENT Descriptive, narrative, personal statements Devo controllare maggiormente il mio peso. GENERIC-ANSWER Generic answer Si, No, Non so AGREE Expression of agreement, e.g. acceptance Si, so che importante. of a proposal, plan or opinion REJECT Expression of disagreement, e.g. rejection Ho sentito tesi contrastanti al proposito. of a proposal, plan, or opinion KIND-ATT-SMALLTALK Expression of kind attitude through po- Grazie,Sei per caso offesa per qualcosa che liteness, e.g. thanking, apologizing or ho detto? smalltalk Table 1: Full set of speech act labels considered at iLISTEN mulation of SVM, namely SVMhmm (Altun et al., i-th utterance. Given the corresponding sequence 2003) so that the classification of the ith utterance of expected labels y = (y1 , . . . , ys ), a sequence of also depends from the dialogue act assigned at the m step-specific labels (from a dictionary of d dif- i − 1th utterance. ferent dialogue acts) can be retrieved, in the form The UNITOR system ranked first in the com- yi−m , . . . , yi−1 . petition, suggesting that the combination of the In order to make the classification of xi depen- adopted structured kernel and the Markovian dent also from the previous decisions, we aug- learning algorithm is beneficial. ment the feature vector of xi introducing a pro- In the rest of the paper, Section 2 describes the jection function ψm (xi ) ∈ Rmd that associates to adopted machine learning method and the under- each example a md−dimensional feature vector lying semantic kernel functions. In Section 3, the where each dimension set to 1 corresponds to the performance measures of the system are reported presence of one of the d possible labels observed while Section 4 derives the conclusions. in a history of length m, i.e. m steps before the target element xi . 2 A Markovian Kernel-based Approach In order to apply a SVM, a projection function The UNITOR system implements a Marko- φm (·) can be defined to consider both the observa- vian formulation of the Support Vector Machine tions xi and the transitions ψm (xi ) by concatenat- (SVM) learning algorithm. The SVM adopts a ing the two representations as follows: structured kernel function in order to estimate the syntactic and semantic similarity between utter- φm (xi ) = xi || ψm (xi ) ances in a dialogue, without the need of any task- specific manual feature engineering (only the de- pendency parse of each sentence is required). In with φm (xi ) ∈ Rn+md . Notice that the symbol the rest of this section, first the learning algorithm || here denotes the vector concatenation, so that is presented, then the adopted kernel method is ψm (xi ) does not interfere with the original feature discussed. space, where xi lies. Kernel-based methods can be applied in order to 2.1 A Markovian Support Vector Machine model meaningful representation spaces, encod- The aim of a Markovian formulation of SVM is to ing both the feature representing individual exam- make the classification of a input example xi ∈ Rn ples together with the information about the transi- (belonging to a sequence of examples) dependent tions. According to kernel-based learning (Shawe- on the label assigned to the previous elements in a Taylor and Cristianini, 2004), we can define a ker- history of length m, i.e., xi−m , . . . , xi−1 . nel function Km (xi , zj ) between a generic item of In our classification task, a dialogue is a se- a sequence xi and another generic item zj from the quence of utterances x = (x1 , . . . , xs ) each of same or a different sequence, parametric in the his- them representing an example xi , i.e., the specific tory length m: it surrogates the product between ROOT φm (·) such that: OBJ DET Km (xi , zj ) = φm (xi )φm (zj ) = AUX ADVMOD DET: POSS obs tr  =K (xi , zj ) + K ψm (xi ), ψm (zj ) Devo controllare maggiormente il mio peso In other words, we define a kernel that is the lin- AUX VERB ADV DET DET NOUN ear combination of two further kernels: K obs op- erating over the individual examples xi and a K tr Figure 1: Dependency Parse Tree of “Devo con- operating over the feature vectors encoding the in- trollare maggiormente il mio peso” (In English: volved transitions. It is worth noticing that K obs “I need to control my weight more” ) does not depend on the position nor the context of individual examples in line with Markov assump- tion characterizing a large class of these generative A natural approach to exploit such linguistic models, e.g. HMM. For simplicity, we define K tr information consists in applying kernel methods as a linear kernel between input instances, i.e. a (Robert Müller et al., 2001; Shawe-Taylor and dot-product in the space generated by ψm (·): Cristianini, 2004) on structured representations of data objects, e.g., documents. A sentence s can be represented as a parse tree2 that expresses the Km (xi , zj ) = K obs (xi , xj ) + ψm (xi )ψm (zj ) grammatical relations implied by s. Tree kernels (TKs) (Collins and Duffy, 2001) can be employed At training time, we use the kernel-based to directly operate on such parse trees, evaluating SVM in a One-Vs-All schema over the feature the tree fragments shared by the input trees. This space derived by Km (·, ·). The learning pro- operation corresponds to a dot product in the im- cess provides a family of classification functions plicit feature space of all possible tree fragments. f (xi ; m) ⊂ Rn+md × Rd , which associate each Whenever the dot product is available in the xi to a distribution of scores with respect to the implicit feature space, kernel-based learning different d labels, depending on the context size algorithms, such as SVMs, can operate in order to m. At classification time, all possible sequences automatically generate robust prediction models. y ∈ Y + should be considered in order to deter- TKs thus allow to estimate the similarity among mine the best labeling ŷ, where m is the size of texts, directly from sentence syntactic structures, the history used to enrich xi , that is: that can be represented by parse trees. The underlying idea is that the similarity between X ŷ = arg max{ f (xi ; m)} y∈Y + i=1...m two trees T1 and T2 can be derived from the number of shared tree fragments. Let the set In order to reduce the computational cost, a T = {t1 , t2 , . . . , t|T | } be the space of all the Viterbi-like decoding algorithm is adopted1 . The possible substructures and χi (n2 ) be an indicator next section defines the kernel function K obs ap- function that is equal to 1 if the target ti is rooted plied to specific utterances. at the node n2 and 0 otherwise. A tree-kernel 2.2 Structured Kernel Methods for Speech function over TP 1 and T2P is defined as follows: Act Labeling T K(T1 , T2 ) = n1 ∈NT n2 ∈NT2 ∆(n1 , n2 ) 1 where NT1 and NT2 are the sets of Several NLP tasks require the explorations of nodes of T1 and T2 respectively, and complex semantic and syntactic phenomena. P|T | For instance, in Paraphrase Detection, verifying ∆(n1 , n2 ) = k=1 χk (n1 )χk (n2 ) which com- whether two sentences are valid paraphrases in- putes the number of common fragments between volves the analysis of some rewriting rules in trees rooted at nodes n1 and n2 . The feature space which the syntax plays a fundamental role. In generated by the structural kernels obviously Question Answering, the syntactic information is depends on the input structures. Notice that crucial, as largely demonstrated in (Croce et al., different tree representations embody different 2011). linguistic theories and may produce more or less effective syntactic/semantic feature spaces for a 1 When applying f (xi ; m) the classification scores are 2 normalized through a softmax function and probability scores Parse trees can be extracted using automatic parsers. In are derived. our experiments, we used SpaCy https://spacy.io/. given task. Capitalizing lexical information in Convolution Dependency grammars produce a significantly Kernels. The tree kernels introduced above per- different representation which is exemplified in form a hard match between nodes when compar- Figure 1. Since tree kernels are not tailored to ing two substructures. In NLP tasks, when nodes model the labeled edges that are typical of de- are words, this strict requirement reflects in a too pendency graphs, these latter are rewritten into strict lexical constraint, that poorly reflects seman- explicit hierarchical representations. Different tic phenomena, such as the synonymy of differ- rewriting strategies are possible, as discussed in ent words or the polysemy of a lexical entry. To (Croce et al., 2011): a representation that is shown overcome this limitation, we adopt Distributional to be effective in several tasks is the Grammatical models of Lexical Semantics (Sahlgren, 2006; Relation Centered Tree (GRCT) illustrated in Fig- Mikolov et al., 2013) to generalize the meaning ure 2: the PoS-Tags are children of grammatical of individual words by replacing them with ge- function nodes and direct ancestors of their asso- ometrical representations (also called Word Em- ciated lexical items. beddings) that are automatically derived from the analysis of large-scale corpora. These representa- ROOT tions are based on the idea that words occurring in AUX VERB ADVMOD OBJ the same contexts tend to have similar meaning: AUX controllare::v ADV DET DET: POSS NOUN the adopted distributional models generate vec- tors that are similar when the associated words ex- dovere::a maggiormente::a DET DET peso::n hibit a similar usage in large-scale document col- il::d mio::d lections. Under this perspective, the distance be- tween vectors reflects semantic relations between Figure 2: Grammatical Relation Centered Tree the represented words, such as paradigmatic rela- (GRCT) of “Devo controllare maggiorment il mio tions, e.g., quasi-synonymy. These word spaces peso”. allow to define meaningful soft matching between Different tree kernels can be defined accord- lexical nodes, in terms of the distance between ing to the types of tree fragments considered in their representative vectors. As a result, it is pos- the evaluation of the matching structures. In the sible to obtain more informative kernel functions Subtree Kernel (Vishwanathan and Smola, 2002), which are able to capture syntactic and semantic valid fragments are only the grammatically well phenomena through grammatical and lexical con- formed and complete subtrees: every node in a straints. subtree corresponds to a context free rule whose The Smoothed Partial Tree Kernel (SPTK) de- left hand side is the node label and the right hand scribed in (Croce et al., 2011) exploits this idea side is completely described by the node descen- extending the PTK formulation with a similarity dants. Subset trees are exploited by the Sub- function σ between nodes: set Tree Kernel (Collins and Duffy, 2001), which is usually referred to as Syntactic Tree Kernel ∆SP T K (n1 , n2 ) = µλσ(n1 , n2 ) , if n1 and n2 are leaves (STK); they are more general structures since their  ∆SP T K (n1 , n2 ) = µσ(n1 , n2 ) λ2 + leaves can be non-terminal symbols. The subset trees satisfy the constraint that grammatical rules X ~ ~ Y ~1 ) l(I  cannot be broken and every tree exhaustively rep- + λd(I1 )+d(I2 ) ∆SP T K cn1 (i1k ), cn2 (i2k ) ~1 ,I I ~2 :l(I ~1 )=l(I ~2 ) k=1 resents a CFG rule. Partial Tree Kernel (PTK) (1) (Moschitti, 2006) relaxes this constraint consider- ing partial trees, i.e., fragments generated by the In the SPTK formulation, the similarity function application of partial production rules (e.g. se- σ(n1 , n2 ) between two nodes n1 and n2 can be quences of non terminal with gaps). The strict defined as follows: constraint imposed by the STK may be problem- • if n1 and n2 are both lexical nodes, then ~v ·~v atic especially when the training dataset is small σ(n1 , n2 ) = σLEX (n1 , n2 ) = τ ~v n1 ~nv2 . k n1 kk n2 k and only few syntactic tree configurations can be It is the cosine similarity between the word observed. The Partial Tree Kernel (PTK) over- vectors ~vn1 and ~vn2 associated with the la- comes this limitation, and usually leads to higher bels of n1 and n2 , respectively. τ is called accuracy, as shown in (Moschitti, 2006). terminal factor and weighs the contribution of the lexical similarity to the overall kernel specific kernel (i.e. the contribution of the lexi- computation. cal nodes in the overall computation) and of the • else if n1 and n2 are nodes sharing the same SVM algorithm have been tuned via 10-cross fold label, then σ(n1 , n2 ) = 1. validation over the training set. In the Markovian • else σ(n1 , n2 ) = 0. SVM, a history of m = 1 previous steps allowed In the challenge we adopt the SPTK in order to im- to achieve the best results during the parameteri- plement the K obs function used in the Markovian zation step. Given the limited size of the dataset, SVM. This kernel in fact has been showed very higher values of m led to sparse representation of robust in the classification of (possibly short) sen- the transitions ψm that are not helpful. As a multi- tences, such as in Question Classification (Croce classification task, results are reported in terms et al., 2011) or Semantic Role Labeling (Croce et of precision (P), recall (R) and F1-score with re- al., 2011; Croce et al., 2012). spect to the gold standard, as reported in Table 3. These are averaged across each utterance (micro- #user #system #total statistics) and per class (macro-statistics). Dataset #dialogues turns turns turns train 40 1,097 1,119 2,216 Among the two submitted systems, UNITOR test 20 479 492 971 (reported on top of the table) achieved best re- complete 60 1,576 1,611 3,187 sults, both considering micro and macro statistics, Table 2: Statistics about the iLISTEN dataset where a F1 of 0.733 and 0.653 are achieved, re- spectively. These results are higher with respect to the other participant (namely System2) and far 3 Experimental Results higher than the baseline (that confirms the diffi- culty of the task). In iLISTEN, the reference dataset includes the transcriptions of 60 dialogues amounting to about Given the unbalanced number of examples 22, 000 words. The detailed statistics regarding di- for each class, UNITOR achieves higher results alogues and turns in the train and test dataset are w.r.t. the micro statistics, while lower results are reported in Table 2. achieved w.r.t. classes with a reduced number of examples. The confusion matrix reported in Ta- Micro Macro ble 4 shows that some recurrent misclassifications Run P R F1 P R F1 (e.g. the STATEMENT class with respect to the UNITOR .733 .733 .733 .681 .628 .653 System2 .685 .685 .685 .608 .584 .596 REJECT class) need to be carefully addressed. Baseline .340 .340 .340 .037 .111 .056 Clearly this is a very challenging task, also for the annotators, where the differences between the Table 3: Results of the UNITOR system speech act is not strictly defined: as for example, given the stimulus of the system “Bisognerebbe In the proposed classification workflow each ut- mangiare solo se si ha fame, ed aspettare che la terance from the training/test material is processed digestione sia completata prima di assumere altri through the SpaCy3 dependency parser whose out- cibi6 ” the answer “a volte il lavoro non mi per- puts are automatically converted into GRCT struc- mette di mangiare con ritmi regolari!7 ” should tures4 discussed in Section 2. These structures are be labeled as REJECT while the system provides used within the Markovian SVM implemented in STATEMENT. Overall this results is straightfor- KeLP5 (Filice et al., 2015; Filice et al., 2018). ward, also considering that the system did not The learning algorithm is based on a SPTK required any task specific feature modeling, but combined with a One-Vs-All multi-classification the adopted structured kernel based method allows schema adopted to assign individual utterances to capitalizing the syntactic and semantic informa- the targeted classes. All the parameters of the tion useful for the task. The only requirement 3 of the system is the availability of a dependency It is freely available for several languages (including Ital- ian) at https://spacy.io/ parser. 4 Utterances may include more than one sentence and po- 6 tentially generate different trees. These cases are handled as Translated in English: “You should only eat if you are follows: all trees after the first one are linked through the cre- hungry, and wait until digestion is complete before eating ation of an artificial link between their roots and the root of again.” the tree generated by the first sentence. 7 Translated in English: “sometimes work doesn’t allow 5 http://www.kelp-ml.org/?page_id=215 me to eat at a regular pace!” STATEMENT KIND-ATT. GEN.-ANSW. REJECT CLOSING SOL.-CLAR. OPENING AGREE INFO-REQ. STATEMENT 153 6 3 24 0 3 0 2 13 KIND-ATT. 4 17 0 5 1 2 0 3 2 GEN.-ANSW. 1 0 48 0 0 1 0 6 0 REJECT 0 3 0 3 0 0 0 0 1 CLOSING 0 0 0 0 7 1 0 1 0 SOL.-CLAR. 0 6 0 2 1 8 0 1 2 OPENING 0 0 0 0 0 0 11 0 0 AGREE 0 3 1 1 0 0 0 11 1 INFO-REQ. 4 9 0 4 1 9 0 0 93 Table 4: Confusion Matrix of the UNITOR system w.r.t. gold standard. In column the number of classes from the gold standard, while rows report the system decisions. In bold correct classifications. 4 Conclusions Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 263–272. In this paper the description of the UNITOR sys- Simone Filice, Giuseppe Castellucci, Danilo Croce, tem participating to the iLISTEN task at EvalIta and Roberto Basili. 2015. Kelp: a kernel- 2018 has been provided. The system ranked first based learning platform for natural language pro- in all the evaluations. Thus, the proposed classifi- cessing. In Proceedings of ACL-IJCNLP 2015 Sys- cation strategy shows the beneficial impact of the tem Demonstrations, pages 19–24. combination of a structured kernel-based method Simone Filice, Giuseppe Castellucci, Giovanni Da San with a Markovian classifier, capitalizing the con- Martino, Alessandro Moschitti, Danilo Croce, and tribution of the dialogue modeling in deciding the Roberto Basili. 2018. Kelp: a kernel-based learning platform. Journal of Machine Learning Research, speech act of individual sentences. One impor- 18(191):1–5. tant finding of this evaluation is that a quite ro- bust speech classifier can be obtained with al- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- most no requirement in term of task-specific fea- tations in vector space. CoRR, abs/1301.3781. ture and system engineering: results are appeal- ing mostly considering the reduced size of the Alessandro Moschitti. 2006. Efficient convolution ker- nels for dependency and constituent syntactic trees. dataset. Further work is needed to improve the In ECML, Berlin, Germany. overall F1 scores, possibly extending the adopted Nicole Novielli and Pierpaolo Basile. 2018. Overview kernel function by addressing other dimensions of the evalita 2018 italian speech act labeling (ilis- of the linguistic information or also making the ten) task. In Tommaso Caselli, Nicole Novielli, kernel more sensitive to task-specific knowledge. Viviana Patti, and Paolo Rosso, editors, Proceed- Also the combination of the adopted strategy with ings of the 6th evaluation campaign of Natural recurrent neural approaches is an interesting re- Language Processing and Speech tools for Italian (EVALITA’18), Turin, Italy. CEUR.org. search direction. Klaus Robert Müller, Sebastian Mika, Gunnar Rätsch, Koji Tsuda, and Bernhard Schölkopf. 2001. An References introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181– Y. Altun, I. Tsochantaridis, and T. Hofmann. 2003. 201. Hidden Markov support vector machines. In Pro- ceedings of the International Conference on Ma- Magnus Sahlgren. 2006. The Word-Space Model. chine Learning. Ph.D. thesis, Stockholm University. John Shawe-Taylor and Nello Cristianini. 2004. Ker- Michael Collins and Nigel Duffy. 2001. Convolution nel Methods for Pattern Analysis. Cambridge Uni- kernels for natural language. In Proceedings of Neu- versity Press. ral Information Processing Systems (NIPS’2001), pages 625–632. David R. Traum, 1999. Speech Acts for Dialogue Agents, pages 169–201. Springer Netherlands, Dor- Danilo Croce, Alessandro Moschitti, and Roberto drecht. Basili. 2011. Structured lexical similarity via con- volution kernels on dependency trees. In Proceed- Vladimir N. Vapnik. 1998. Statistical Learning The- ings of EMNLP. ory. Wiley-Interscience. S.V.N. Vishwanathan and Alexander J. Smola. 2002. Danilo Croce, Alessandro Moschitti, Roberto Basili, Fast kernels on strings and trees. In Proceedings of and Martha Palmer. 2012. Verb classification us- Neural Information Processing Systems, pages 569– ing distributional similarity in syntactic and seman- 576. tic structures. In Proceedings of the 50th Annual