<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>of the Assessing DIScourse COherence in Italian TEXts task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dominique Brunato</string-name>
          <email>dominique.brunato@ilc.cnr.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Colla</string-name>
          <email>davide.colla@unito.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felice Dell'Orletta</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irene Dini</string-name>
          <email>irene.dini@ilc.cnr.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Paolo Radicioni</string-name>
          <email>daniele.radicioni@unito.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Amelio Ravelli</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pisa - ItaliaNLP Lab</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>The Assessing DIScourse COherence in Italian TEXts D(isCoTEX) task is the first shared task focused on modelling discourse coherence for Italian real-word texts, which has been proposed for the first time at EVALITA 2023. Providing two diferent datasets from diferent textual genres, we arranged the task into two independent tasks: a more traditional one, aimed at evaluating whether models are able to distinguish well-organized documents from corrupted ones and a less explored one, which assesses the models' performance on texts evaluated for coherence by human raters. In this paper, we describe the datasets released, we discuss the diferent approaches tackled by the participating systems and provide a first analysis of the obtained results.</p>
      </abstract>
      <kwd-group>
        <kwd>text coherence</kwd>
        <kwd>Italian language</kwd>
        <kwd>computational modeling</kwd>
        <kwd>evaluation campaign</kwd>
        <kwd>dataset</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivation</title>
      <p>Coherence is a key property of any well-organized text ration from frameworks like the Centering Theory 2[].
and it plays a crucial role in human discourse processing. One popular approach in this context is the entity-grid
Indeed, as individuals process unfolding text, they are re- approach, which focuses on assessinglocal coherence,
quired to assemble information from single sentences and specifically the transitions between adjacent sentences
to draw inferences between and among them in order to (see, among others, [3, 4]). More recently, also neural
create a meaningful mental representation of the whole
text. According to the tripartite model developed by [1],
models have been applied to deal with both structured
representations of text and unstructured text by taking
text, encoding diferent, and progressively more abstract,
this is the outcome of a three-step process in which read- advantage of neural models’ ability to learn useful
repreers construct multileveled memory representations of a sentations for the task, e.g. [5, 6]. Modelling coherence in
natural language is of pivotal importance in a variety of
information at each level. From this perspective, coher- downstream applications, from automatic essay scoring
ence is an inherently psychological construct, thus very in language learning scenarios [7, 8], to language
assesshard to be modelled; however, it also has a counterpart at
the level of linguistic content and structure, often referred
ment in clinical settings [9, 10] Additionally, from the
Natural Language Generation point of view, coherence
to as “cohesion”, a property of a text that is conveyed is an intrinsic evaluation metric to assess the quality of
by signalling linguistic devices such as reference, ellip- generated texts. An emerging area of interest pertains to
sis, discourse connectives, argument overlap, which help the interpretability of modern deep neural networks. In
readers make explicit the logical links between diferent
units in texts.
this respect, while existing work on probing pre-trained
language models has largely focused on sentence-level</p>
      <p>As regards the computational modelling, coherence properties, the ability of these models to encode discourse
has been widely investigated in the Natural Language and pragmatic phenomena is still unclear [11, 12, 13].
EVALITA 2023: 8th Evaluation Campain of NLP and Speech Tools for</p>
      <sec id="sec-1-1">
        <title>CoTEX, organized in the context of the 8th evaluation</title>
        <p>campaign of NLP and speech tools for the Italian
language (EVALITA 2023) [14] intends to encourage research
on automatic discourse coherence modeling with
emphasis on the Italian language.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Datasets</title>
      <p>The dataset1 utilized for the DisCoTEX task
encom2. Definition of the Task passes texts sourced from two distinct origins: the Italian
Wikipedia and the Italian speech transcripts section of
Drawing inspiration from existing coherence modeling the Multilingual TEDx corpus (mTEDx). These sources
literature, the DisCoTEX task was designed with the in- represent two diferent language varieties: the former is
tention of addressing two distinct scenarios. The first a ‘standard’ written variety, and the latter a ‘hybrid’
vascenario involves the evaluation of models’ ability to riety combining diverse genres (e.g., university lectures,
diferentiate well-structured documents from corrupted newspaper articles, conference presentations, and TV
ones. The corrupted documents are typically created by science programs) as well as diferent semiotic modes,
either shufling the sentence order of the original docu- such as written, spoken, audio, and video 1[7]. Extensive
ment or replacing specific linguistic elements that con- research on genre and register variation acknowledges
tribute to coherence within and across sentences, such that written and spoken language employ distinct
strateas personal pronouns or discourse connectives. The sec- gies to establish coherence within a text [18]. Therefore,
ond scenario, which has been less explored, focuses on we decided to evaluate systems on both these types of
assessing the models’ performance in coherence evalu- data.
ation by comparing their predictions to human raters’ For sub-task 1, each data sample consists of a prompt,
evaluations. which is a paragraph comprising three sentences,
fol</p>
      <p>To capture these distinct scenarios, we proposed two lowed by a target sentence. To create the written dataset,
independent sub-tasks: we leveraged the existing paragraph segmentation in
Wikipedia to select four-sentence paragraphs. For the
• Sub-task 1 - Last sentence classification : This spoken dataset, as mTEDx speeches lacked such internal
sub-task was casted as a binary classification task. structure, we divided all the transcripts into passages of
Specifically, participants are presented with a four sentences. The target sentence is determined as the
prompt, which is a short paragraph consisting of immediate continuation of the prompt, forming a
coherapproximately three consecutive sentences, and ent sample. In the case of a non-coherent passage, as
an individual sentence referred to as the target. previously anticipated, we selected either a sentence
ranThe objective is to classify whether the target sen- domly taken from a diferent document or the sentence
tence, when combined with the prompt, forms a that appears ten sentences after the prompt in the same
coherent or incoherent text. The negative target document. Each final dataset consists of 8,000 training
can either be a sentence randomly selected from a samples and 800 test samples. Examples can be found in
diferent document or a sentence extracted from Table 1.
the same document as the prompt, in order to Regarding sub-task 2, the dataset construction difers
introduce incremental degrees of complexity on slightly. In this case, for each source we extracted samples
the resolution of the task; consisting solely of four-sentence paragraphs (we keep
• Sub-task 2 - Human score prediction: This sub- the term ‘prompts’ to refer to them), with half of them
task was framed as a regression task where par- deliberately made incoherent through sentence
perturticipants were asked to predict the average co- bations. The possible perturbations, chosen with equal
herence score assigned by human raters to short probability, include:
paragraphs. These paragraphs were evaluated
in their original or artificially modified version.</p>
      <p>As shown in previous tasks on the automatic
assessment of subjective phenomena [15, 16], this
scenario is expected to be more challenging, as
it requires modeling the human perception of
coherence, which can be influenced by both
linguistic and non-linguistic factors, as highlighted
in previous studies [7].</p>
      <p>• Flip of two random sentences: each sentence of
the prompt has the same probability of being
lfipped.
• Swap of a sentence with the next 10th of the same
document from which the prompt was extracted.</p>
      <p>The first and the last sentence have double the
swap probability compared to the middle two
sentences to make the swapping of the first/last
or a middle one equiprobable.</p>
      <sec id="sec-2-1">
        <title>For both sub-tasks, dataset were extracted from two corpora representative of two distinct domains, as described in the following section.</title>
        <sec id="sec-2-1-1">
          <title>For the purposes of theDisCoTEX task, we selected</title>
          <p>1,064 prompts equally balanced between the two domains.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>1The DisCoTEX dataset is available at the following link:</title>
          <p>https://github.com/davidecolla/DisCoTex/</p>
          <p>Prompt Target Class
Il regolamento del carcere era durissimo e le condizioni igienicheQui le sevizie di ogni genere venivano inflitte 1
drammatiche. Agli ebrei erano negati i pochi diritti concessi aglisoprattutto sugli ebrei che non rivelavano i
altri prigionieri politici e comuni, ovvero l’ora d’aria in cortilree,capiti o i nascondigli dei loro parenti della
l’assistenza sanitaria, la possibilità di ricevere lettere e pacchciui presenza a Milano o nei dintorni le SS
e di acquistare generi alimentari allo spaccio del carcere. Gelriano venute a conoscenza tramite loro spie.
interrogatori degli arrestati erano condotti in uno stanzone a
pian terreno, detto il ”refettorio”.</p>
          <p>Ci siamo trovati a Brasilia, la capitale del Brasile; e c’erano cittVàedete in questo semplicissimo grafico, il 0
di tutto il mondo, dall’Australia al Giappone, all’Asia, all’Africrao,sso è tutto quello che prima,una decina di
agli Stati Uniti. E lì abbiamo avuto la consapevolezza che siamo anni fa, andava a smaltimento.
un movimento che sta crescendo nel mondo e che sempre più
costruisce risultati e vantaggi. Una delle più grandi città del
mondo che ha fatto questa scelta è San Francisco.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Of these, 33% (i.e. 360 prompts) are extracted from the</title>
        <p>subset of authentic prompts and 66% (i.e. 704) from
perturbed ones. Examples can be found in Table2.</p>
        <p>As anticipated, coherence was assessed through
manual annotation. Specifically, to gather human ratings
of coherence, we conducted a crowdsourcing task on
the Prolific 2 platform, involving Italian native speakers.</p>
        <p>Recognizing that coherence is a subjective concept
inlfuenced by the reader or listener’s interpretation, we
employed a gradual judgment approach, and asked them
to evaluate their perception of on a Likert scale ranging
from 1 to 5. The number of annotations per prompts
ranged from 9 to 12, with an average of 11.75.</p>
        <p>The resulting dataset was split into training and test
samples with a proportion of 80% to 20%, respectively.</p>
        <p>In Figure 1 we show some general statistics about col- Figure 1: Overview of human judgments collected for the
lected judgments considering both the whole dataset dataset used in sub-task 2. The plot shows the overall mean of
of prompts and prompts grouped into specific sections, human judgments for the whole dataseta(ll) and for respective
according to genre and perturbation. As it can be seen, subsets, including both coherent*_(no_pert) and perturbed
prompts derived from Wikipedia texts are generally rated prompts (*_pert).
as more coherent by humans compared to TEDx prompts.</p>
        <p>This observation confirms previous findings with regard
to the influence of genre on the perception of coherence
[7]. What is particularly interesting is that this disparity 3.1. Format
is evident not only in the original form of the prompts
but also in the perturbed versions. This seems to suggest The DisCoTEX dataset was released as tab-separated text
that Wikipedia documents tend to exhibit a more stan- files. Specifically, for sub-task 1, the two data sources
dardized structure, including internal coherence, which (i.e. Wikipedia and TED) were kept separated and, for
remains relatively stable even with minor alterations that each source, participants were provided with a file with
afect sentence order or the insertion of an intruder sen- a following structure:
tence from the same document. We plan to conduct a
more in-depth analysis by examining each perturbation
strategy independently to gain a deeper understanding
of their individual impact on coherence.
• ID: a numerical identifier for the entry;
• PROMPT: textual passage made by three
consec</p>
        <p>utive sentences;
• TARGET: the sentence which participants are
asked to assess if it is coherent with the prompt
(i.e. it is the next sentence after the prompt);</p>
      </sec>
      <sec id="sec-2-3">
        <title>2https://www.prolific.co/</title>
        <p>Le nuove idee sono una sfida che accende nel nostro cervello la stessa area che elabora le minac4c.8e3
fisiche. Ecco perché tendiamo a reagire con forza, a volte con aggressività, alle nuove idee. Davanti
a informazioni che mettano in discussione le nostre convinzioni noi tendiamo paradossalmente a
reagire raforzandole ancora di più. Si chiamano bias cognitivi, sono molto forti e ci caschiamo tutti.
I Romani furono scommettitori appassionati, specialmente ai tempi dell’Impero Romano, e il gioco 3.3
0.95
dei dadi era popolare, seppur proibito da una ”Lex alearia” del 204 a.C. circa, eccetto che durante i
Saturnali. Orazio derise la gioventù dell’epoca che sprecava tempo tra i pericoli del gioco invece di
domare il suo cavallo e darsi alle durezze dell’inseguimento. Una di queste diceva che nessuna causa
poteva essere intentata da una persona che permetteva il gioco d’azzardo nella sua casa anche se era
stata imbrogliata o assalita. Le scommesse sui dadi per denaro fu l’oggetto di molte leggi Romane.
• ID: a simple identifier for the entry;
• TEXT: the 4-sentence prompt to be evaluated;</p>
      </sec>
      <sec id="sec-2-4">
        <title>To decide whether the target sentence  is coherent</title>
        <p>with the paragraph  we computed the median distance
• MEAN: the coherence score of the text to be pre- value across the whole training dataset, and we used
dicted, based on the mean of the human judge- this as a threshold: all the test samples with a distance
ments collected;
Examples extracted from the dataset of sub-task 2. The first one is an original prompt taken from the mTEDx corpus. The
second ones is a perturbed prompt from the Wikipedia corpus, with a swap between the third and the last sentence.
• CLASS: the class to be predicted (1 if the target the prompt, and the target sentence  based on Hamming
follows the prompt, 0 otherwise).</p>
        <p>distance coeficient</p>
        <p>For sub-task 2, we mixed data from the two sources
and released a single dataset with the following structure:
:
1</p>
        <p>(∑ ⟨</p>
        <p>=0
 ( , ) =
 , ⟩ ) .</p>
        <p>Mean</p>
        <p>Stdev
pated in both sub-tasks opted to use the same strategy
for both challenges. None of the systems chose to utilize
 ( ) =</p>
        <p>1
 − 1 =1
−1
∑ (⃖⃗</p>
        <p>, ⃖⃖+⃖⃖⃖1⃗).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Participants</title>
      <sec id="sec-3-1">
        <title>We received a total of 3 submissions for sub-task 1 and 2 submissions for sub-task 2 from 3 diferent teams. Each and mTEDx data.</title>
        <sec id="sec-3-1-1">
          <title>In the context of DisCoTEX, for both sub-tasks par</title>
          <p>For sub-task 2 we first extracted the one-hot
vecenhance their models, with the exception of Wikipedia { 1,  2, ...,   }:
ticipants could leverage further external resources to tors from each sentence   in the input prompt  =
• For sub-task 1: the evaluation metric is Accuracy coherence score for the paragraph ,  ( ) , we averaged
(the ratio between correctly predicted samples the scores featuring each pair of adjacent sentences:
value under the median have been considered coherent,
incoherent otherwise.</p>
          <p>⃖⃖1⃗ ←  1, ⃖⃖2⃗ ←  2, ..., ⃖⃖⃗ ←   .</p>
          <p />
          <p>Then we computed the proximity between each
consecutive vectors pair ⟨⃖⃗, ⃖⃖+⃖⃖⃖1⃗⟩ ∈  through Jaccard distance</p>
          <p>metric</p>
          <p>, thereby resulting in (n-1) distance scores,
grasping the degree of semantic overlap between each
two neighbouring sentences. In order to compute the</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation measures</title>
      <sec id="sec-4-1">
        <title>We defined the following evaluation metrics for each</title>
        <p>sub-task:
and all processed samples) obtained by each
system in the test set. We also reported Precision,</p>
        <p>Recall and F-score for the two classes;
• For sub-task 2: the evaluation metric is the
harmonic mean between Pearson and Spearman
correlation coeficients between the participants’
scores and test set scores.
tence   in the input prompt  = { 1,  2, ...,   }, as well
as for the target sentence  . The distance between the
prompt  and the target sentence  ,  ( , )</p>
        <p>is computed
as the average distance between each sentence   from</p>
        <p>Baseline The baseline for both tasks has been com- submission had the option to include up to three
diferputed by employing one-hot vectors representations: For ent runs. The strategies used to approach the task are
sub-task 1 we extracted the one hot vector for each sen- all very diferent from each other. Teams that
particiTeam
MPG
IUSSNets
ExtremITA</p>
        <p>Members</p>
        <p>Afiliations</p>
        <p>#
Runs
3
3
4</p>
        <p>Sony Computer Science Laboratories Paris, France
Enrico Fermi’s Research Center (CREF), Rome, Italy,
Sapienza University of Roma, Italy
Iuss Pavia, Italy
Università degli Studi di Roma Tor Vergata,
Università di Torino, Italy
!
!
!</p>
        <p>X
!
!
4
9
6
additional resources apart from the oficial datasets.
Further information regarding the task participation can be
found in Table 5.</p>
        <p>Team
extremita
IUSSNets
mpg
baseline</p>
        <p>Model
llama
bert
lgbm
hamming
MPG</p>
      </sec>
      <sec id="sec-4-2">
        <title>The MPG team [19] utilized the tree-based classifier Light</title>
        <p>GBM incorporating a set of explicitly engineered features
aiming at comparing the prompt and target with respect
to several metrics such as TF-IDF vectors, counts of upper instance of the dataset preceeded by the task and
subcase words, tenses, punctuation, words, and characters, task name and it produced the predicted label or score as
as well as sentence embeddings extracted from Sentence- output. Conversely, the extremITLLaMa model, which
BERT [20]. They exclusively participated in sub-task 1 requires a structured prompt, was provided with a textual
with two runs . description of the task and the desired output format
specification. For sub-task 1 the prompt is: “Le due frasi
preceIUSSNets denti, separate da ’[SEP]’, sono coerenti tra loro? Rispondi
sì o no” ; while for sub-task 2 the prompt is: “Quanto è
The IUSSNets team [21] employed fine-tuning techniques coerente questa frase in una scala da 0 a 5?”. Their team
on four distinct Italian language models: BERT-ita 2[2], emerged as the winner across bothDisCoTEX sub-tasks
Electra-ita [22], Umberto [23], and Bertino [24] sepa- and datasets, thanks to the LLAMA-based model.
Howrately for each sub-task. For sub-task 1, they submitted ever, the iT5-based model performed considerably worse,
three BERT fine-tuned models: the first fine-tuned on especially in the second sub-task where it remained
beWikipedia (BERT 1), the second on mTEDx (BERT 2), and low the baseline.
the third on both (BERT 3), achieving the second-place
score. For sub-task 2, they submitted BERT, Bertino, and
Electra fine-tuned models, once again securing the sec- 6. Results
ond position, primarily due to the performance of the
Electra model.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Tables 4 and 5 report the leaderboard of systems taking</title>
        <p>part in sub-task 1 and sub-task 2, respectively. Note that,
for the purpose of the oficial ranking, for sub-task 1 we
ExtremITA considered the accuracy of the best run, and we further
The ExtremITA team [25] competed using two multi-task computed the mean between the best result/run on Wiki
Language Models. The first model (ExtremITA-iT5) is and the best result/run on mTEDx data. Conversely, for
an encoder-decoder based on iT5-small [26], while the sub-task 2 we first computed both Pearson and
Spearsecond model (ExtremIT-LLaMA) is a decoder based on man correlations, then we applied the harmonic mean
Camoscio [27], the Italian version of LLAMA [28]. These between the two measures.
models largely difer in number of parameters: iT5-small As it can be seen, all systems outperform the baseline
has approximately 110 Million parameters, while the used in both sub-tasks. The best performance was achieved
version of Camoscio has 7 Billion parameters. Both mod- by the team extremITA with the system based on the
els underwent joint fine-tuning on all EVALITA 2023 LLAMA model.
tasks and sub-tasks, leveraging prompting techniques.</p>
        <p>For both DisCoTEX the extremIT5 model received each
ence within text. The second one intended to model (Volume 1: Long Papers), Association for
Computhe human perception of text coherence by predicting tational Linguistics, Vancouver, Canada, 2017, pp.
the average score attributed to human raters to a text. 1320–1330. URL: https://aclanthology.org/P17-1121.
A novel dataset was developed for this task comprising doi:10.18653/v1/P17- 1121.
texts from two diferent domains, representative of a [6] J. Li, D. Jurafsky, Neural net models of open-domain
written and spoken language variety in order to inves- discourse coherence, ArXiv abs/1606.01545 (2017).
tigate the role of modality on the automatic modeling [7] A. Lai, J. R. Tetreault, Discourse coherence in the
of coherence. Three teams participated in the task and wild: A dataset, evaluation and methods, CoRR
submitted a total of 19 runs. Notably, the ExtremITA abs/1805.04993 (2018). URL: http://arxiv.org/abs/
team secured the first position in both sub-tasks with 1805.04993. arXiv:1805.04993.
their system based on the largest decoder model pro- [8] M. Mesgar, M. Strube, A neural local coherence
posed. However, it is worth highlighting that smaller model for text quality assessment, in: Proceedings
models with fewer parameters also demonstrated com- of the 2018 conference on empirical methods in
parable performance, indicating their efectiveness in natural language processing, 2018, pp. 4328–4339.
capturing discourse-related information. Quite surpris- [9] B. Elvevåg, P. W. Foltz, D. R. Weinberger, T. E.
ingly, the results of sub-task 2 revealed that systems were Goldberg, Quantifying incoherence in speech: an
more proficient in predicting coherence scores for TEDx automated methodology and novel application to
talks compared to Wikipedia texts, which calls for further schizophrenia, Schizophrenia research 93 (2007)
investigation by also expanding the current dataset of hu- 304–316.
man evaluated texts. Future plans involve extending the [10] D. Iter, J. Yoon, D. Jurafsky, Automatic detection
DisCoTEX task to a multilingual perspective, enabling co- of incoherent speech for diagnosing schizophrenia,
herence modeling exploration across diferent languages in: Proceedings of the Fifth Workshop on
Compuusing reproducible data collection processes in languages tational Linguistics and Clinical Psychology: From
with available Wiki and TED resources. Keyboard to Clinic, 2018, pp. 136–146.
[11] A. Shen, M. Mistica, B. Salehi, H. Li, T. Baldwin,</p>
        <p>J. Qi, Evaluating Document Coherence Modeling,
Acknowledgements Transactions of the Association for Computational
Linguistics 9 (2021) 621–640. URL: https://doi.org/
The authors gratefully acknowledge the support of the
PNRR MUR project PE0000013-FAIR. [12] 1M0..11C6h2/etna,cl_Za._0C03h8u8,. dKo.i:1G0 .im11p6e2l,/taclE_vaa_lu0a0t3i8o8n.
benchmarks and learning criteria for
discourseReferences aware sentence representations, arXiv preprint
arXiv:1909.00142 (2019).
[1] T. A. Van Dijk, W. Kintsch, Strategies of discourse [13] Y. Farag, J. Valvoda, H. Yannakoudakis, T. Briscoe,
comprehension, Academic Press, New York, 1983. Analyzing neural discourse coherence models, in:
[2] B. J. Grosz, A. K. Joshi, S. Weinstein, Centering: Proceedings of the First Workshop on
CompuA framework for modeling the local coherence of tational Approaches to Discourse, Association
discourse, Computational Linguistics 21 (1995) for Computational Linguistics, Online, 2020, pp.
203–225. URL: https://aclanthology.org/J95-2003. 102–112. URL: https://aclanthology.org/2020.codi-1.
[3] R. Barzilay, M. Lapata, Modeling Local Coherence: 11. doi:10.18653/v1/2020.codi- 1.11.</p>
        <p>An Entity-Based Approach, Computational Linguis- [14] M. Lai, S. Menini, M. Polignano, V. Russo, R.
Sprugtics 34 (2008) 1–34. URL: https://doi.org/10.1162/ noli, G. Venturi, EVALITA 2023: Overview of the
coli.2008.34.1.1. doi:10.1162/coli.2008.34.1.1. 8th evaluation campaign of natural language
proarXiv:https://direct.mit.edu/coli/article- cessing and speech tools for italian, in: Proceedings
pdf/34/1/1/1798481/coli.2008.34.1.1.pdf. of the Eighth Evaluation Campaign of Natural
Lan[4] M. Elsner, E. Charniak, Disentangling chat with guage Processing and Speech Tools for Italian. Final
local coherence models, in: Proceedings of the Workshop (EVALITA 2023), CEUR.org, Parma, Italy,
49th Annual Meeting of the Association for Com- 2023.
putational Linguistics: Human Language Technolo- [15] D. Brunato, C. Chesi, F. Dell’Orletta, S.
Montegies, Association for Computational Linguistics, magni, G. Venturi, R. Zamparelli, AcCompl-it @
Portland, Oregon, USA, 2011, pp. 1179–1189. URL: EVALITA2020: Overview of the Acceptability &amp;
https://aclanthology.org/P11-1118. Complexity Evaluation Task for Italian, EVALITA
[5] D. Tien Nguyen, S. Joty, A neural local coherence Evaluation of NLP and Speech Tools for Italian
model, in: Proceedings of the 55th Annual Meeting December 17th, 2020 (2020).
of the Association for Computational Linguistics [16] L. Gregori, M. Montefinese, D. P. Radicioni, A. A.
Ravelli, R. Varvara, Concretext @ EVALITA2020: [26] G. Sarti, M. Nissim, IT5: Large-scale text-to-text
The Concreteness in Context Task., EVALITA Eval- pretraining for Italian language understanding and
uation of NLP and Speech Tools for Italian - Decem- generation, ArXiv preprint 2203.03759 (2022). URL:
ber 17th, 2020 (2020). https://arxiv.org/abs/2203.03759.
[17] G. Caliendo, The popularisation of science in web- [27] A. Santilli, Camoscio: An Italian instruction-tuned
based genres, The language of popularisation: The- LLaMa, https://github.com/teelinsan/camoscio,
oretical and descriptive models 3 (2012) 101–132. 2023.
[18] D. Biber, S. Conrad, R. Reppen, Corpus linguistics: [28] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.
investigating language structure and use., Cam- Lachaux, T. Lacroix, B. Rozière, N. Goyal, E.
Hambridge University Press, Cambridge, 1998. bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave,
[19] M. Galletti, P. Gravino, G. Prevedello, MPG G. Lample, LLaMa: Open and eficient foundation
at DisCoTex: Predicting text coherence by tree- language models, 2023. arXiv:2302.13971.
based modelling of linguistic features, in: M. Lai,
S. Menini, M. Polignano, V. Russo, R. Sprugnoli,
G. Venturi (Eds.), Proceedings of the Eighth
Evaluation Campaign of Natural Language Processing and
Speech Tools for Italian. Final Workshop (EVALITA
2023), CEUR.org, September 7th-8th 2023, Parma,
2023.
[20] N. Reimers, I. Gurevych, Sentence-BERT: Sentence
embeddings using Siamese BERT-networks, in:
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), Association
for Computational Linguistics, Hong Kong, China,
2019, pp. 3982–3992. URL: https://aclanthology.org/</p>
        <p>D19-1410. doi:10.18653/v1/D19-1410.
[21] E. Zanoli, M. Barbini, C. Chesi, IussNets at
DisCo</p>
        <p>Tex: A fine-tuned approach to coherence, in: M. Lai,
S. Menini, M. Polignano, V. Russo, R. Sprugnoli,
G. Venturi (Eds.), Proceedings of the Eighth
Evaluation Campaign of Natural Language Processing and
Speech Tools for Italian. Final Workshop (EVALITA
2023), CEUR.org, September 7th-8th 2023, Parma,
2023.
[22] S. Schweter, Italian BERT and ELECTRA models,
2020. URL: https://doi.org/10.5281/zenodo.4263142.</p>
        <p>doi:10.5281/zenodo.4263142.
[23] L. Parisi, S. Francia, P. Magnani, UmBERTo: an</p>
        <p>Italian language model trained with whole word
masking, https://github.com/musixmatchresearch/
umberto, 2020.
[24] M. Mufo, E. Bertino, BERTino: an Italian
Distil</p>
        <p>BERT model, Computational Linguistics CLiC-it
2020 (2020) 317.
[25] C. D. Hromei, D. Croce, V. Basile, R. Basili,
ExtremITA at EVALITA2023: Multi-task sustainable
scaling to large language models at its extreme, in:
M. Lai, S. Menini, M. Polignano, V. Russo, R.
Sprugnoli, G. Venturi (Eds.), Proceedings of Eighth
Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian. Final Workshop
(EVALITA 2023), CEUR.org, September 7th-8th
2023, Parma, 2023.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>