<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the Evalita 2018 itaLIan Speech acT labEliNg (iLISTEN) Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Basile</string-name>
          <email>pierpaolo.basile@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicole Novielli</string-name>
          <email>nicole.novielli@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università degli Studi di Bari Aldo Moro Dipartimento di Informatica Via E. Orabona</institution>
          ,
          <addr-line>4 - 70125 Bari</addr-line>
          ,
          <country country="IT">ITALY</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. We describe the first edition of the “ itaLIan Speech acT labEliNg” (iLISTEN) task at the EVALITA 2018 campaign (Caselli et al., 2018). The task consists in automatically annotating dialogue turns with speech act labels, i.e. with the communicative intention of the speaker, such as statement, request for information, agreement, opinion expression, or general answer. The task is justified by the large number of applications that could benefit from automatic speech act annotation of natural language interactions such as tools for the intelligent information access, that is by relying on natural dialogues. We received two runs from two teams, one from academia and the other one from industry. In spite of the inherent complexity of the tasks, both systems largely outperformed the baseline.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. Descriviamo la prima edizione
del task di “itaLIan Speech acT
labEliNg” (iLISTEN) organizzato nell’ambito
della campagna di valutazione EVALITA
2018. Il task consiste nell’annotazione
automatica di turni di dialogo con
la label di speech act corrispondente.
Ciascuna categoria di speech act
denota l’intenzione comunicativa del
parlante, ossia l’intenzione di formulare
un’affermazione oggettiva, l’espressione
di un’opinione, la richiesta di
informazioni, una risposta, un’espressione
di consenso. Riteniamo che il task
sia rilevante per la il dominio della
linguistica computazionale e non solo,
alla luce del recente interesse da parte
della comunità scentifica nei confronti dei
paradigmi di interazione e accesso
intelligente all’informazione basati su dialogo.
Il task ha visto la partecipazione di due
team, uno accademico e uno industriale.
Nonostante la complessità del task
proposto, entrabi i team hanno ampiamente
superato la baseline.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>
        Speech acts have been extensively investigated in
linguistics
        <xref ref-type="bibr" rid="ref1 ref12">(Austin, 1962; Searle, 1969)</xref>
        , and
computational linguistics
        <xref ref-type="bibr" rid="ref14 ref15">(Traum, 2000; Stolcke et al.,
2000)</xref>
        since long. Specifically, the task of
automatic speech act recognition has been addressed
leveraging both supervised
        <xref ref-type="bibr" rid="ref10 ref14 ref16 ref8">(Stolcke et al., 2000;
Vosoughi and Roy, 2016)</xref>
        and unsupervised
approaches
        <xref ref-type="bibr" rid="ref11 ref6">(Novielli and Strapparava, 2011)</xref>
        . This
interest is justified by the large number of
applications that could benefit from automatic speech act
annotation of natural language interactions.
      </p>
      <p>
        In particular, a recent research trend has
emerged to investigate methodologies to enable
intelligent access to information, that is by
relying on natural dialogues as interaction metaphor.
In this perspective, chat-oriented dialogue systems
are attracting the increasing attention of both
research and practitioners interested in the
simulation of natural dialogues with embodied
conversational agents (Klüwer, 2011), conversational
interfaces for smart devices
        <xref ref-type="bibr" rid="ref10">(McTear et al., 2016)</xref>
        and
the Internet of Things
        <xref ref-type="bibr" rid="ref10 ref16 ref8">(Kar and Haldar, 2016)</xref>
        . As
a consequence, we are assisting to the flourishing
of dedicated research venues on chat-oriented
interaction. It is the case of WOCHAT1, the Special
Session on Chatbots and Conversational Agents,
now at its second edition, as well as the
Natural Language Generation for Dialogue Systems
special session2, both co-located with the Annual
SIGdial Meeting on Discourse and Dialogue.
      </p>
      <p>
        While not representing any deep understanding
of the interaction dynamics, speech acts can be
successfully employed as a coding standard for
natural dialogues tasks. In this report, we describe
the first edition of the “itaLIan Speech acT
labEliNg” (iLISTEN) task at the EVALITA 2018
campaign
        <xref ref-type="bibr" rid="ref2">(Caselli et al., 2018)</xref>
        . Among the various
challenges posed by the problem of enabling
conversational access to information, this shared task
tackles the problem of recognition of the
illocutionary force, i.e. the speech act, of a dialogue
turn, that is the communicative goal of the speaker.
      </p>
      <p>The remainder of the paper is organized as
follows. We start by explaining the task in
Section 2. In Section 3, we provide a detailed
description of the dataset of dialogues, the
annotation schema, and the data format and distribution
protocol. Then, we report about the evaluation
methodology (see Section 4) and describe the
participating systems and their performance (see
Section 5). We provide final remarks in Section 6.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Task Description</title>
      <p>The task consists in automatically annotating
dialogue turns with speech act labels, i.e. with
the communicative intention of the speaker, such
as statement, request for information, agreement,
opinion expression, general answer, etc. Table 1
reports the full set of speech act labels used for the
classification task, with definition, examples, and
distribution in our corpus. Regarding the
evaluation procedure, we assess the ability of each
system to issue the correct speech act label among
those included in the taxonomy used for
annotation, described in the Section 3. Please, note that
the participating systems are requested to issue
labels only for the speech act used for labeling the
user’s dialogue turns, as futher detailed in the
following.
3
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Development and Test Data</title>
      <sec id="sec-4-1">
        <title>A Dataset of Dialogues</title>
        <p>
          We leverage the corpus of natural language
dialogues collected in the scope of previous research
about interaction with Embodied Conversational
Agents (ECAs)
          <xref ref-type="bibr" rid="ref3">(Clarizio et al., 2006)</xref>
          , in order to
speed up the process of building a gold standard.
The corpus contains overall transcripts of 60
dialogues, 1,576 user dialogue turns, 1,611 system
turns and about 22,000 words.
        </p>
        <p>The dialogues were collected using a Wizard
of Oz tool as dialogue manager. Sixty subjects
(aged between 21–28) were involved in the study,
in two interaction mode conditions: thirty of them
interacted with the system in a written-input
setting, using keyboard and mouse; the remaining
thirty dialogues were collected with users
interacting with the ECA in a spoken-input condition. The
dialogues collected using the spoken interaction
mode were manually transcribed based on
audiorecording of the dialogue sessions.</p>
        <p>During the interaction, the ECA played the role
of an artificial therapist and the users were free to
interact with it in natural language, without any
particular constraint: they could simply answer the
question of the agent or taking the initiative and
ask questions in their turn, make comments about
the agent behavior or competence, argument in
favor or against the agent’s suggestion or
persuasion attempts. The Wizard, on his behalf, had to
choose among a set of about 80 predefined
possible system moves. As such, the system moves
(see Table 2) are provided only as a context
information but are not subject to evaluation and do
not contribute to the final ranking of the
participant systems. Conversely, the participating
systems are evaluated on the basis of the performance
observed for the user dialogue turns (see Table 1).
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Annotation Schema</title>
        <p>
          Speech acts can be identified with the
communicative goal of a given utterance, i.e. it
represents its meaning at the level of its
illocutionary force
          <xref ref-type="bibr" rid="ref1">(Austin, 1962)</xref>
          . In defining dialogue
act taxonomies, researchers have been trying to
solve the trade-off between the need for formal
semantics and the need for computational
feasibility, also taking into account the specificity of
the many application domains that have been
investigated (see
          <xref ref-type="bibr" rid="ref15">(Traum, 2000)</xref>
          for an exhaustive
overview). The Dialogue Act Markup in Several
Layers (DAMSL) represents an attempt by
          <xref ref-type="bibr" rid="ref4">(Core
and Allen, 1997)</xref>
          to define a domain independent
framework for speech act annotation.
        </p>
        <p>Defining a speech act markup language is out
of the scope of the present study. Therefore, we
adopt the original annotation of the Italian
advicegiving dialogues. Table 1 shows the set of nine
labels employed for the purpose of this study, with
definitions and examples. These labels are used
for the annotation of the users’ dialogue turns and
are the object of classification for this task. In
addition, in Table 1 we report the speech act labels
used for the dialogue moves of the system, i.e. the
conversational agent playing the role of the
artificial therapist. The speech act taxonomy refines the
DAMSL categories to allow appropriate tagging
of the communicative intention with respect to the
application domain, i.e. persuasion dialogues in
the healthy eating domain.</p>
        <p>In Table 3 we provide an excerpt from a
dialogue from our gold standard. The system moves
(dialogue moves and corresponding speech act
labels) are chosen from a set of predefined dialogue
moves that can be played by the ECA. As such,
they are not interesting for the evaluation and
ranking of participating systems and are provided only
as contextual information. Conversely, the final
ranking of the participating systems is based on
the performance observed only on the prediction
of speech acts for the users’ move, with respect
to the set of labels provided in Table 1. Please,
note that the two sets of speech act labels for the
user and the system moves, in Table 1 and Table
2, respectively, only partially overlap. This is due
to the fact that the set of agent’s moves includes
also speech acts (such as persuasion attempts) that
are observed only for the agent, given its caregiver
role in the dialogue systems. Vice versa, some
speech act labels (such as clarification questions)
are relevant only for the user moves.
3.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Data Format and Distribution</title>
        <p>We provide both the training and testing dialogues
in the XML format following the structure
proposed in Figure 1. Each participating initially had
access to the training data only. Later, the
unlabeled test data were released during the evaluation
period. The development and test data set
contain 40 and 20 dialogues, respectively, equally
distributed with respect to the interaction mode
(textvs. speech-based interaction).
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>Regarding the evaluation procedure, we assess the
ability of each system to issue the correct speech
act label for the user moves. The speech act label
used for annotation of the user moves are reported
in Table 1.</p>
      <p>Specifically, we compute precision, recall and
F1-score (macroaveraging) with respect to our
gold standard. This approach, while more verbose
than a simple accuracy test, arise from the need to
correctly address the unbalanced distribution of
labels in the dataset. Furthermore, by providing
detailed performance metrics, we intend to enhance
interesting discussion on the nature of the problem
and the data, as they might emerge from the
participants’ final reports. As a baseline, we use the
most frequent label for the user speech acts (i.e.,
STATEMENT).</p>
      <sec id="sec-5-1">
        <title>Class</title>
        <p>
          OPENING
CLOSING
INFO-REQUEST
SOLICITATION-REQ-CLARIF
STATEMENT
GENERIC-ANSWER
AGREE-ACCEPT
REJECT
KIND-ATT-SMALLTALK
The task was open to everyone from industry and
academia. Sixteen participants registered, but only
two teams actually submitted the results for the
evaluation. A short description of each system
follows:
UNITOR - The system described in
          <xref ref-type="bibr" rid="ref2 ref5 ref7">(Croce and
Basili, 2018)</xref>
          is a supervised system which
relies on a Structured Kernel-based Support
Vector Machine for making the classification
of the dialogue turns sensitive to the
syntactic and semantic information of each
utterance. The Structured Kernel is a Smoothed
Partial Tree Kernel
          <xref ref-type="bibr" rid="ref6">(Croce et al., 2011)</xref>
          that
exploits both the parse tree and the cosine
similarity between the word vectors in a
distributional semantics model. The authors use
the tree parser provided by SpaCy3 and the
Kelp framework4 for SVM.
        </p>
        <p>X2Check - The team did not submit the report.</p>
        <p>
          The performance of the participating systems is
evaluated based on the macro (and micro)
precision and recall
          <xref ref-type="bibr" rid="ref13">(Sebastiani, 2002)</xref>
          . However, the
official task measure used to rank the systems is
the macro-F. Results are reported in Table 4.
3https://spacy.io/
4KeLP is a Java Kernel-based Learning Platform: http:
//www.kelp-ml.org/
        </p>
        <p>The best performance (0.6531) is provided by
the UNITOR system. Both systems are able
to overcome the baseline also for micro-F. The
baseline has a low macro-F since it predicts
always the same class (STATEMENT) and for the
other classes the F-measure is zero. As
expected, the micro-F overcomes the macro-F since
some classes are hard to predict due to the low
number of examples in the training data, such
as AGREE, SOLICITATION-REQ-CLARIF and
REJECT. Precision, Recall, and F-score values by
speech act labels are showed in Table 5.</p>
        <p>
          We also provide the confusion matrix for each
system, respectively Table 6 for UNITOR and
Table 7 for X2Check. We observe that, for both
systems, the class REJECT is the most difficult
to classify. This evidence is consistent with the
findings from previous research on the same
corpus of dialogues
          <xref ref-type="bibr" rid="ref11 ref6">(Novielli and Strapparava, 2011)</xref>
          .
In particular, we observe that dialogue moves
belonging to the REJECT class are often
misclassified as STATEMENT. More in general, the main
cause of error is the misclassification as
STATEMENT. One possible reason is that statements
represent the majority class, thus inducing a bias in
the classifiers. Another possible explanation, is
that dialogue moves that appear to be linguistically
consistent with the typical structure of statements
have been annotated differently, according to the
actual communicative role they play.
We presented the first edition of the new shared
task about itaLIan Speech acT labEliNg
(iLISTEN) at EVALITA 2018. The task fits in the
fastgrowing research trend focusing on conversational
access to the information, e.g. using chatbots or
conversational agents. The task consists in
automatically annotating dialogue turns with speech
act labels, representing the communicative
intention of the speaker. The corpus of dialogues has
been collected in the scope of previous research on
natural language interaction with embodied
conversational agents. Specifically, the participating
systems had to annotate the speech acts associated
to the user dialogue moves while the agent’s
dialogue turns were provided as context.
        </p>
        <p>
          We received two runs from two teams, one from
academia and the other one from industry. In
spite of the inherent complexity of the tasks, both
systems largely outperformed the baseline,
represented by the trivial classifier always predicting
the majority class for users’ moves. The best
performing system leverages syntactic features and
relies on a Structured Kernel-based Support
Vector Machine. Follow-up editions might involve
extending the benchmark with dialogues from
different domains. Similarly, dialogues in different
languages might be also included in the gold
standard, as done for Automatic Misogyny
Identification task at EVALITA 2018
          <xref ref-type="bibr" rid="ref7">(Fersini et al., 2018)</xref>
          .
This would enable to assess to what extent the task
is inherently dependent on the language and how
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>John L. Austin</surname>
          </string-name>
          .
          <year>1962</year>
          .
          <article-title>How to do things with words</article-title>
          .
          <source>William James Lectures</source>
          . Oxford University Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Tommaso</given-names>
            <surname>Caselli</surname>
          </string-name>
          , Nicole Novielli, Viviana Patti, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>EVALITA 2018: Overview of the 6th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          . In Tommaso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors,
          <source>Proceedings of Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          ), Turin, Italy. CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Giuseppe</given-names>
            <surname>Clarizio</surname>
          </string-name>
          , Irene Mazzotta, Nicole Novielli, and Fiorella De Rosis.
          <year>2006</year>
          .
          <article-title>Social attitude towards a conversational character</article-title>
          . pages
          <fpage>2</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Mark G.</given-names>
            <surname>Core and James F. Allen</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Coding Dialogs with the DAMSL Annotation Scheme</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Danilo</given-names>
            <surname>Croce</surname>
          </string-name>
          and
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Basili</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A Markovian Kernel-based Approach for itaLIan Speech acT labEliNg</article-title>
          . In Tommaso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors,
          <source>Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18)</source>
          , Turin, Italy. CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Danilo</given-names>
            <surname>Croce</surname>
          </string-name>
          , Alessandro Moschitti, and
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Basili</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Structured lexical similarity via convolution kernels on dependency trees</article-title>
          .
          <source>In Proceedings of EMNLP.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Elisabetta</given-names>
            <surname>Fersini</surname>
          </string-name>
          , Debora Nozza, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the Evalita 2018 Task on Automatic Misogyny Identification (AMI)</article-title>
          .
          <source>In Tommaso Caselli</source>
          , Nicole Novielli, Viviana Patti, and Paolo Rosso, editors,
          <source>Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18)</source>
          , Turin, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Rohan</given-names>
            <surname>Kar</surname>
          </string-name>
          and
          <string-name>
            <given-names>Rishin</given-names>
            <surname>Haldar</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Applying Chatbots to the Internet of Things: Opportunities and Architectural Elements</article-title>
          . CoRR, abs/1611.03799.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Tina</given-names>
            <surname>Klüwer</surname>
          </string-name>
          .
          <year>2011</year>
          . “
          <string-name>
            <given-names>I</given-names>
            <surname>Like Your Shirt</surname>
          </string-name>
          ”
          <article-title>- Dialogue Acts for Enabling Social Talk in Conversational Agents</article-title>
          .
          <source>In Intelligent Virtual Agents</source>
          , pages
          <fpage>14</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Michael</surname>
            <given-names>McTear</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Zoraida</given-names>
            <surname>Callejas</surname>
          </string-name>
          , and David Griol Barres.
          <year>2016</year>
          . The Conversational Interface: Talking to Smart Devices. Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Nicole</given-names>
            <surname>Novielli</surname>
          </string-name>
          and
          <string-name>
            <given-names>Carlo</given-names>
            <surname>Strapparava</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Dialogue act classification exploiting lexical semantics</article-title>
          .
          <source>In Conversational Agents and Natural Language Interaction: Techniques and Effective Practices, chapter 4</source>
          , pages
          <fpage>80</fpage>
          -
          <lpage>106</lpage>
          . IGI Global.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>John R.</given-names>
            <surname>Searle</surname>
          </string-name>
          .
          <year>1969</year>
          .
          <article-title>Speech Acts: An Essay in the Philosophy of Language</article-title>
          . Cambridge University Press, Cambridge, London.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Fabrizio</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Machine learning in automated text categorization</article-title>
          .
          <source>ACM computing surveys (CSUR)</source>
          ,
          <volume>34</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Stolcke</surname>
          </string-name>
          , Noah Coccaro, Rebecca Bates, Paul Taylor, Carol Van Ess-Dykema, Klaus Ries, Elizabeth Shriberg, Daniel Jurafsky, Rachel Martin,
          <string-name>
            <given-names>and Marie</given-names>
            <surname>Meteer</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech</article-title>
          . Comput. Linguist.,
          <volume>26</volume>
          (
          <issue>3</issue>
          ):
          <fpage>339</fpage>
          -
          <lpage>373</lpage>
          , September.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>David R.</given-names>
            <surname>Traum</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>20 Questions for Dialogue Act Taxonomies</article-title>
          .
          <source>Journal of Semantics</source>
          ,
          <volume>17</volume>
          (
          <issue>1</issue>
          ):
          <fpage>7</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Soroush</given-names>
            <surname>Vosoughi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Deb</given-names>
            <surname>Roy</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A Semiautomatic Method for Efficient Detection of Stories on Social Media</article-title>
          .
          <source>In Proc. of the 10th AAAI Conf. on Weblogs and Social Media</source>
          .,
          <source>ICWSM 2016</source>
          , pages
          <fpage>711</fpage>
          -
          <lpage>714</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>