<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Last Utterance Proactivity Prediction in Task-oriented Dialogues</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sofia Brenna</string-name>
          <email>sbrenna@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bernardo Magnini</string-name>
          <email>magnini@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Via Sommarive 18, Povo, Trento, 38123</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Free University of Bozen-Bolzano</institution>
          ,
          <addr-line>3 Dominikanerplatz 3 - Piazza Domenicani 3, Bozen-Bolzano, 39100</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>While current LLMs achieve excellent performance in information seeking tasks, their conversational abilities when participants need to collaborate to jointly achieve a communicative goals (e.g., booking a restaurant, fixing an appointment, etc.) are still far from those exhibited by humans. Among various collaborative strategies, in the paper we focus on proactivity, i.e., when a participant ofers useful information that was not explicitly requested. We propose a new task, called last utterance proactivity prediction aimed at assessing the capacity of an LLM to detect proactive utterances in a dialogue. In the task, a model is given a small portion of a dialogue (that is, a dialogue snippet) and asked to determine whether the last utterance of the snippet is proactive or not. There are several benefits in using dialogue snippets: (i) they are more manageable than full dialogues, allowing to reduce complexity; (ii) several phenomena in dialogue, including proactivity, depend on a short context, which allows a model to learn from snippets, rather than full dialogues; and (iii) dialogue snippets make it easier to experiment on balanced datasets, overcoming the skew distribution of proactivity in whole dialogues. In the paper, we first introduce a dataset for the last utterance proactivity prediction task. The dataset has then been used to instruct an LLM to classify proactivity. We run a series of experiments showing that predicting proactive utterance in a dialogue is feasible in a few-shot configuration, opening the road towards models that are able to generate proactive utterances like humans do.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;task-oriented dialogues</kwd>
        <kwd>pragmatics</kwd>
        <kwd>proactivity</kwd>
        <kwd>automated annotation</kwd>
        <kwd>large language models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>a:</p>
      <p>In order to model proactivity, we follow a similar approach as Shaikh et al., which focuses on
grounding acts, a class of collaborative behaviors investigated in dialogue pragmatics. The main idea
is that grounding acts can be: (i) identified and annotated by a Large Language Model; (ii) modeled
through appropriate fine tuning of the model itself. In addition, our work is related to recent approaches
that use large language models as annotators [9], [10], [11]. In the long-term, our research goal is to
instruct LLMs to be as proactive as humans are.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Last Utterance Proactivity Prediction</title>
      <p>The goal of the paper is to show the feasibility of automatic detection of proactive utterances in
taskoriented dialogues. We propose a task, Last Utterance Proactivity Prediction, where a portion of a
dialogue (i.e., a dialogue snippet) is given to a model, which has to predict whether the last utterance
of the snippet is proactive or not proactive. Using dialogue snippets, instead of full dialogues, brings
several benefits: (i) dialogue snippets are much more manageable than full dialogues, allowing us to
reduce the complexity of understanding and annotation; (ii) several phenomena in dialogue, including
proactivity, depend on a short context, which allows a model to learn from snippets, rather than full
dialogues; and (iii) dialogue snippets make it easier to experiment on balanced datasets, overcoming
the skew distribution of proactivity in whole dialogues (it has been estimated that about 85% of the
utterance in a task-oriented dialogue is not proactive).</p>
      <p>We started with D-PRO2, a corpus of manually annotated task-oriented dialogues, which includes
151 dialogues from diferent sources, amounting to 2,855 turns and over 6,000 utterances, and carried
out the following steps:
• we transformed the whole-dialogue annotation task to a one-utterance annotation task: given
a short dialogue context, the model needs to establish whether the final utterance is either
’proactive’ or ’not_proactive’;
• in order to shorten the provided dialogue context, we collected 4 conversational turns’ worth
of excerpts (snippets) from each dialogue. We believe 4 turns to be a convenient context for
proactivity annotation since statistics in the D-PRO Corpus on turn-adjacency between proactive
utterances and the turn that triggers them revealed that an average of 77.7% proactive utterances
are a direct response to the previous turn’s utterances;
• to restore balance among labels we choose the same number of snippets that ended without
proactivity as the snippets that ended with proactivity.</p>
      <p>A relevant consequence of the reduction of the provided dialogue context, is a significant reduction
of the input prompt length for a LLM, and therefore a reduction of computational need.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Setting</title>
      <p>This section reports the main features of the setting we used to experiment the last utterance proactivity
task introduced in Section 2.</p>
      <p>
        Dataset for the experiments. The dataset for the experiments has been derived from D-PRO,
a corpus equipped with manually curated proactivity-oriented annotations. D-PRO comprises 151
dialogues from 5 task-oriented dialogue sub-corpora, namely, Italian Whatsapp Corpus ([12]), the Italian
Nespole! Corpus ([13, 14]), Jilda ([
        <xref ref-type="bibr" rid="ref7">7, 15</xref>
        ]), the Italian Ubuntu Chat Corpus ([16]), and Multiwoz 2.2 ([17]).
Most of the dialogues are in Italian, with the only exception of the Multiwoz 2.2 dialogues and some
dialogues from the Italian Whatsapp Corpus due to code mixing and code switching employed by the
speakers. D-PRO proactivity annotations are performed at the utterance level.
      </p>
      <p>The composition of our experimental dataset is as follows: from D-PRO we gathered as many 4-turn
proactive dialogue snippets as there were proactive utterances, so that each snippet ended with a
diferent proactive utterance. Then, we extracted as many non-proactive snippets as there were
nonproactive utterances, so that each snippet ended with a diferent non-proactive utterance. Finally, we
selected an equal amount of non proactive dialogue snippets as proactive ones at random to restore
balance between the two types of snippets.</p>
      <p>Data splitting. From each of the 5 corpora in D-PRO we randomly selected 30 dialogue snippets as a
train set (to be used as few-shot examples), 50 snippets as a validation set (to be used for parameter
optimization), and 100 snippets as a test set3.</p>
      <p>Model. For the choice of the best model for proactivity prediction we carried out a number of
experiments, reported in Section 4.1. The model selected is Openai’s GPT-4o-2024-08-06, used with
temperature = 0.</p>
      <p>Prompt optimisation. A prompt engineering phase took pace, so that various prompt proposals
were tested in the same setting (the same train snippets used as few-shot examples, same validation
snippets used to evaluate the model). Figure 1 shows the final prompt used in our experiments. The
prompt consists of two main parts: (i) the system prompt, which contains the general task instructions
given to the model; (ii) the messages prompt, that is further dividend into alternating user messages
and assistant messages: this is the part of the prompt where the model receives few-shot examples (user
messages) with answers (assistant message). The final user/assistant pair contains the target dialogue
which is being evaluated by the model at current time.</p>
      <p>Baseline. A random chance baseline (accuracy = 0.5, see Table 8) is created by eliminating any system
and message prompt except for "Output exclusively either "proactive" or "not_proactive." and providing
the target dialogue snippet.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Parameter Setting</title>
      <p>We first present several optimization trials performed on a single corpus (MultiWoz) in order to select
the best LLM (4.1), to assess the impact of the number (4.2) and of the order (4.3) of few-shot example
3The set sizes were established on the capacity of the least proactive sub-corpus, MultiWOZ, which featured 90 proactive
utterances, hence 90 proactive snippets and 90 non-proactive snippets.
snippets. Secondly, we run test on the DEV set of each of the five corpora in order to select the best
few-shot snippets order for each corpus (4.4).</p>
      <sec id="sec-4-1">
        <title>4.1. Setting the Large Language Model</title>
        <p>Once the best prompt had been established, we tested the APIs of various models to pick the best
cost/performance trade-of, reported in Table 1. GPT-4o-2024-08-06 was selected as the best performing
model (Accuracy: 0.74, F1: 0.68) with lower fares than GPT-4o-2024-05-13 and better scores than both
GPT-4o-mini and GPT-4o-mini-2024-07-18.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Setting the Number of Few-shot Dialogue Snippets</title>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Assessing the Impact of Few-shot Dialogue Snippets Order</title>
        <p>This experiment assesses the stability of the model while changing the order of the few-shot examples.
As literature points out [18, 19, 20], LLMs sufer of a position bias when handling a longer context: we
found that this is the case also in our experiments and that there is up to 10 points of a diference in
accuracy when testing with diferent orders of the same set of examples. Table 3 reports the results of
the experiments under six random changes of 12 examples.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Setting Few-Shot Snippets Order</title>
        <p>In this experiment we select the best order of the few-shot snippets for each of the five dialogue datasets
(i.e., Whatsapp, Nespole, Ubuntu, Jilda and MultiWoz) for the last utterance proactivity prediction task.
We tested 5 diferent orders of the dialogue snippets randomly shufling the same set of snippets selected
in section 4.2. We used the following configuration for the experiments: 12 random few-shot snippets;
50 validation (DEV) snippets; 5 random shufles of the few-shot snippets; average of the performances
of the 5 shufles for each corpus. The selection of the optimal order is given by the highest average
accuracy and F1.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>In this section we present the results of the last utterance proactivity prediction task in two diferent
configurations: using few-shots from individual corpora (5.1), and mixing few-shots from all corpora
to evaluate the model’s stability in corrupted context scenarios.</p>
      <sec id="sec-5-1">
        <title>5.1. Few-Shots from Individual Corpus and Testing on Individual Corpus</title>
        <p>We tested the model by using the prompt and few-shot configuration that gave the best results on the
development set for each corpus individually. As Table 5 shows, while IAA with the ground truth labels
is still not optimal, and scores pretty low on both Whatsapp and Ubuntu (fair agreement4), we reach a
4According to Landis and Koch’s scale [21].
moderate agreement in Jilda and Multiwoz, and a substantial agreement in Nespole (0.72) that is just
below the IAA score between human annotators (0.77). Results are consistent with the outcomes that
we obtained on the development set in Table 4. On average, Nespole achieved the best accuracy (0.86),
followed by MultiWoz (0.77), Jilda (0.75), Whatsapp and Ubuntu (both 0.64). For all dateset the results
are largely above the baseline (i.e., 0.50 accuracy, equivalent to chance: see also Table 8, Baselines
Full Context column.), showing that the model has correctly learned our definition of proactivity. The
fact that Nespole has obtained the best results is somehow surprising, given that this corpus is quite
complex: utterances are longer than in the other corpora, and so are dependencies between a proactive
utterance and its own trigger utterance. Longer utterances, on the other hand, mean that the model is
given a slightly richer context on which to base its judgments, which may help with the annotation
process. As far as lowest scores are concerned, the poor results for Whatsapp and especially Ubuntu
were expected, since these are less structured (both syntactically and grammatically), more chaotic, and
multi-party dialogue corpora, where proactivity is much more dificult to be unanimously detected also
by humans and where the human-human IAA scores the lowest (0.63 and 0.41 respectively).</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Few-Shots from All Corpora and Testing on Individual Corpus</title>
        <p>Secondly, we run some in-context learning [22] experiments to check whether few-shot examples from
diferent corpora could improve the performance on one individual target corpus. The idea is drawn
from works on multi-task learning [23], where more than one task is learned simultaneously by the
model, and transfer learning, where improvement is obtained in a new task through the transfer of
knowledge from a related task that has already been learned [24, 25]. Our intuition is that example
variety on very similar tasks may be the key to improvement on single target task. Following this line,
we combine dialogue snippets from all the five corpora as few-shot examples, so that we have 5 sets
of 12 snippets each. Given the position bias hold by the model, we decided to keep the intra-corpus
examples order the same as the one used in 5.1, and to randomly shufle the inter-corpus order 5 times
to select the optimal few-shot prompt (same methodology in 4.4 and 5.1).</p>
        <p>The outcomes of the tests with the optimal prompt are reported Table 6, that is directly comparable
to Table 5. We found out that the only corpus in our experiment that sufered the mixed few-shot
prompting is Jilda, with a significant drop in performance, while every other corpus has very similar
or slightly higher scores. Ablation tests on the Jilda corpus led to an accuracy of 0.71 while removing
the most chaotic corpora (Ubuntu and Whatsapp), proving still that the individual corpus few-shot
approach works best for this one corpus.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Few-Shot from All Corpora, Testing on All Corpora</title>
        <p>We finally tested on the cumulative test set of all corpora, with mixed few-shot examples. We
experimented with two configurations of few-shot snippets: i. 60 snippets in optimal order as in Table 6; ii.
15 snippets in total, with 3 random snippets per corpus. Outcomes in Table 7 show best and average
results over 5 runs for both the 60 and the 15 few-shot examples setting.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Testing with Corrupted Context: Masking the Trigger Utterance</title>
        <p>We investigated the performance of the LLM in a corrupted context situation. According to definition,
the two key characteristics of proactivity are not being solicited and being beneficial to the dialogue
goals, hence proactivity can only be defined in terms of the previous context. When we corrupt
the context before the snippet’s final utterance, we may be compromising the data required for the
proactivity annotation task. We implemented two corrupted context situations: triggering utterance
removed, where the text of the utterance that triggers the final utterance in the dialogue snippet is
removed from the snippet, and triggering utterance masked, where the text of the trigger utterance is
masked by a placeholder. In both circumstances, the presence of a corrupted utterance is indicated
by the utterance number, whereas only the content is erased or masked. Since we need to test the
efect of the context corruption, the model still learns the original full context task from the few-shot
examples. We anticipate that the LLM’s performance will sufer since a critical component of the
dialogue (the triggering utterance) has been compromised. Also, we expect the number of false positives
to increase significantly. This is due to the possibility that by eliminating the trigger utterance, we will
also eliminate the element that renders the final utterance either proactive or non-proactive. Specifically,
we are deleting the element of the context that allows us to determine whether the content of the last
utterance is novel (i.e., proactive) or unrequested by the trigger. Since the request is missing from the
corrupted setting, the triggered response seems to be proactive rather than solicited, resulting in an
increase in "proactive" labels.</p>
        <p>Results, presented in Table 8, confirm our intuition, showing that the model accuracy moves from 0.80
of the full context to 0.66 and 0.64 when the triggering utterance is removed and masked, respectively.
The majority of the performance drop is attributable to an increase in false positives, from 2 in the full
context to 8 in the corrupted context, as well as a drop in true negatives, which supports our hypothesis.
On the other hand, our experiments show that even with insuficiently task-specific few-shot examples,
the model can perform significantly better than the random chance baselines (see also 3) through a
solid instruction prompt.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>We introduced a new task, namely, last utterance proactivity prediction, aiming at assessing the capacity
of Large Language Models to detect and annotate proactive behaviours in task-oriented dialogues.
The task allows us to shorten the context from a whole dialogue to a dialogue snippet, simplify the
annotation process, and balance the dataset for positive and negative labels. We showed that a few-shot
approach with GPT-4o achieves encouraging performance on a test set composed of dialogue snippets
collected from five diferent corpora, and that in particular for the Nespole corpus the agreement
between the model labels and the human-annotated gold labels is nearly equivalent to the agreement
between humans.</p>
      <p>As for future work, there are several ongoing activities. First, we are still investigating techniques to
further improve the performance on the task, especially in testing on the combined dialogues from all
corpora. Then, we plan to use the GPT-4o model to automatically annotate a large amount (i.e., about
100K) of dialogue snippets, in order to create a training corpus, which, in turn, will be used to instruct
an open source model (e.g., Llama 3 8B) to detect proactivity.
language model generations, in: Proceedings of the 2024 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume
1: Long Papers), 2024, pp. 6279–6296.
[9] T. Labruna, S. Brenna, A. Zaninello, B. Magnini, Unraveling chatgpt: A critical analysis of
ai-generated goal-oriented dialogues and annotations, arXiv preprint arXiv:2305.14556 (2023).
[10] B. Ding, C. Qin, L. Liu, L. Bing, S. Joty, B. Li, Is gpt-3 a good data annotator?, 2022. URL: https:
//arxiv.org/abs/2212.10450. doi:10.48550/ARXIV.2212.10450.
[11] F. Huang, H. Kwak, J. An, Is chatgpt better than human annotators? potential and limitations of
chatgpt in explaining implicit hate speech, ArXiv abs/2302.07736 (2023).
[12] F. Hewett, Sequential Organisation in WhatsApp Conversations., Tesi di laurea triennale non
pubblicata, Libera Università di Berlino, semestre estivo, 2017.
[13] S. Burger, L. Besacier, P. Coletti, F. Metze, C. Morel, The nespole! voip dialogue database, in:</p>
      <p>Seventh European Conference on Speech Communication and Technology, 2001.
[14] N. Mana, S. Burger, R. Cattoni, L. Besacier, V. MacLaren, J. McDonough, F. Metze, The nespole!
voip multilingual corpora in tourism and medical domains, in: Eighth European Conference on
Speech Communication and Technology, 2003.
[15] I. Sucameli, A. Lenci, B. Magnini, M. Speranza, M. Simi, Toward data-driven collaborative dialogue
systems: The jilda dataset, Italian Journal of Computational Linguistics (2021).
[16] R. Lowe, N. Pow, I. V. Serban, J. Pineau, The ubuntu dialogue corpus: A large dataset for research
in unstructured multi-turn dialogue systems, in: Proceedings of the SIGDIAL 2015 Conference,
2015, pp. 285–294.
[17] X. Zang, A. Rastogi, S. Sunkara, R. Gupta, J. Zhang, J. Chen, Multiwoz 2.2: A dialogue dataset with
additional annotation corrections and state tracking baselines, arXiv preprint arXiv:2007.12720
(2020).
[18] C. Zheng, H. Zhou, F. Meng, J. Zhou, M. Huang, Large language models are not robust multiple
choice selectors, in: The Twelfth International Conference on Learning Representations, 2023.
[19] X. Chen, R. A. Chi, X. Wang, D. Zhou, Premise order matters in reasoning with large language
models, arXiv preprint arXiv:2402.08939 (2024).
[20] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the middle:
How language models use long contexts, Transactions of the Association for Computational
Linguistics 12 (2024) 157–173.
[21] J. R. Landis, G. G. Koch, The measurement of observer agreement for categorical data, biometrics
(1977).
[22] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, Z. Sui, A survey on in-context
learning, arXiv preprint arXiv:2301.00234 (2022).
[23] Y. Zhang, Q. Yang, An overview of multi-task learning, National Science Review 5 (2018) 30–43.
[24] K. Weiss, T. M. Khoshgoftaar, D. Wang, A survey of transfer learning, Journal of Big data 3 (2016)
1–40.
[25] L. Torrey, J. Shavlik, Transfer learning, in: Handbook of research on machine learning applications
and trends: algorithms, methods, and techniques, IGI global, 2010, pp. 242–264.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bonetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Hromei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Siciliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Stranisci</surname>
          </string-name>
          , Preface to the
          <source>Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI)</source>
          ,
          <source>in: Proceedings of the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI</source>
          <year>2024</year>
          )
          <article-title>co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA</article-title>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Balaraman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <article-title>Proactive systems and influenceable users: Simulating proactivity in task-oriented dialogues</article-title>
          ,
          <source>in: Proceedings of the 24th Workshop on the Semantics and Pragmatics of Dialogue-Full Papers</source>
          , Virually at Brandeis, Waltham, New Jersey,
          <source>July. SEMDIAL</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Balaraman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <article-title>Pro-active systems and influenceable users: Simulating pro-activity in task-oriented dialogues</article-title>
          ,
          <source>Proceedings of the 24th Workshop on the Semantics and Pragmatics of Dialogue</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>P.-M. Strauss</surname>
          </string-name>
          , W. Minker,
          <article-title>Proactive spoken dialogue interaction in multi-party environments</article-title>
          , Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lam</surname>
          </string-name>
          , T.-S. Chua,
          <article-title>A survey on proactive dialogue systems: Problems, methods, and prospects</article-title>
          ,
          <source>arXiv preprint arXiv:2305.02750</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Levinson</surname>
          </string-name>
          , Pragmatics, Cambridge University Press, Cambridge, United Kingdom,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Sucameli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Speranza</surname>
          </string-name>
          ,
          <article-title>Becoming jilda</article-title>
          ,
          <source>in: Proceedings of the Seventh Italian Conference on Computational Linguistics CLIC-it</source>
          <year>2020</year>
          , CEUR-WS, Bologna,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Shaikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gligorić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khetan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gerstgrasser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          , Grounding gaps in
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>