1. Introduction

Processing and Speech Tools for Italian, September

at DisCoTeX: Predicting Text Coherence by Tree-based Modelling of Linguistic Features

Martina Galletti

martina.galletti@sony.com 0 1 2 3 4

Pietro Gravino

pietro.gravino@sony.com 0 2 3 4

Giulio Prevedello

giulio.prevedello@sony.com 0 2 3 4 0 00185 , Italy 1 Department of Computer, Control and Management Engineering (DIAG) “Antonio Ruberti”, Sapienza University of Rome , via Ariosto 25, Rome 2 Enrico Fermi's Research Center (CREF) , via Panisperna 89A, 00184, Rome , Italy 3 Sony Computer Science Laboratories Paris , 6, Rue Amyot, 75005, Paris , France 4 Workshop Proce dings

2023

0 7 08

Automatic text coherence modelling plays a crucial role in natural language processing tasks, such as machine translation, summarisation, and question answering. Moreover, text coherence is fundamental to reading comprehension and readers' engagement, essential to a number of application domains. In this report, we report progress for the Assessing Discourse Coherence in Italian Texts task from EVALITA-23, whose goal is to address automatic coherence detection. The task was challenged by extracting linguistic features used to train a machine learning classifier, leading to minor improvement over the baseline. The feature importance analysis revealed semantic features' relevance, providing indications for future feature engineering and modelling eforts.

Natural language processing text coherence Italian language machine learning

1. Introduction

Coherence is an essential quality to facilitate compresures the quality of organization in the structure of a text and the extent to which a reader can follow the relationships between sentences and paragraphs. Several text coherence models exist in the literature which aim ments. This distinction significantly afects some downstream tasks, such as document summarisation, autotual connectives [ 6 ], entity-grid based approaches [ 7, 8 ] which takes inspiration from Centering Theory [9] to capture coherence [ 7, 10 ]. Recent approaches rely on neural architectures and use Convolutional Neural Networks (CNNs) over an entity-based representation of text [11, 12], Sequence to Sequence Models [ 3, 13 ] and Multi-Task Learning [14]. Nevertheless, language models still face challenges in capturing and predicting global coherence across longer texts, and targeted evaluation paradigms are still being implemented [15, 16, 17, 18, 19]. EVALITA 2023: 8th Evaluation Campaign of Natural Language Italy (G. Prevedello) (G. Prevedello)

0000-0002-0937-8830 (P. Gravino); 0000-0002-9857-2351 “Assessing Discourse Coherence in Italian Texts” (DisCoTeX) [20] presented at EVALITA 2023 [21]. In this report, the feature extraction procedure and the modelling strategy are described in Section 2, while the results are illustrated in Section 3 and discussed in Section 4.

2. Description of the system

The data provided were cleaned and pre-processed in a series of systematic steps. After removing irrelevant characters, the dataset provided by the task organisers was normalised in a standardised format, i.e. lemmas, to break the text into meaningful units. Then, words with less than three characters were removed, keeping stop words. This was done because short words often carry less semantic meaning than longer words, and thus they could increase the vocabulary size of a model without contributing significantly to its training. By removing these words, some potential noise in the training could be removed. On the other hand, stop words, especially conjunctions, were kept for their role in connecting sentences across a text and preserving syntactic and grammatical relationships during the training phase, even if they do not carry a semantic meaning per se. After lemmatization, we added more info to the provided data by computing the length of words and sentences. This was done because the length of the words and/or sentences can provide insights into the complexity and readability of the text, which can, in turn, impact its intrinsic coherence. Longer words can indicate, in fact, a more complex and domainspecific vocabulary, while longer sentences could indicate the presence of multiple sub-clauses, which could impact the overall coherence. Moreover, an abrupt variation in the length of words and sentences could indicate an unbalance in the structure of the text, thus endangering its linguistic coherence. For similar reasons, the statistics of the uses of the diferent tenses in sentences were also computed since the usage of appropriate tenses ensures temporal consistency and logical progression of information. Afterwards, we extracted lexical features such as word frequency, for which we used document term matrix, Term Frequency - Inverse Document Frequency (TF-IDF), and sentence embeddings, i.e. Sentence-BERT (SBERT) [22].

To compress high-dimensional vectors, from the TFIDF analysis and the sentence embedding, Uniform Manifold Approximation and Projection (UMAP) was used to reduce their dimension down to 30 components [23]. UMAP was chosen over the principal component analysis method as it tends to preserve local distances better. Meanwhile, compared to the t-distributed stochastic neighbour embedding method, it is faster and better preserves the global data structure.

Finally, prompt and target were compared by means of the statistics mentioned above, resulting in the following list of features for each data point: • _ _ : weighted Jaccard distance between prompt’s and target’s TF-IDF vectors; • _ _ : cosine distance between prompt’s and target’s TF-IDF vectors; • _ _ : euclidean distance between prompt’s and target’s TF-IDF vectors projected by UMAP; • _ _ : cosine distance between prompt’s and target’s TF-IDF vectors projected by UMAP; • _ _ _ _ _ _ : the number of upper case words in target divided by their sum in prompt and target; • _ _ _ _ : word density in target divided by the sum in prompt and target; • _ _ _ _ : the number of punctuation marks in target divided by their sum in prompt and target; • ℎ _ _ _ _ : the number of characters in target divided by their sum in prompt and target; • _ _ _ _ : the number of words in target divided by their sum in prompt and target; • _ _ : the size of the set of tenses in both target and prompt divided by the one in the target only; • _ : the size of the set of tenses in both target and prompt divided by the one in either target or prompt; • _ _ : the size of the set of entities in both target and prompt divided by the one in the target only; • _ : the size of the set of entities in both target and prompt divided by the one in either target or prompt; • _1 _ : first component of the 2d UMAP projection of the average vector from the embedding of prompt’s sentences; • _2 _ : second component of the 2d UMAP projection of the average vector from the embedding of prompt’s sentences; • _1 _ : first component of the 2d UMAP projection of the vector from the embedding of target’s sentence; • _2 _ : second component of the 2d UMAP projection of the vector from the embedding of target’s sentence; • _ _ : euclidean distance between 30d UMAP projection of the average vector from the embedding of prompt’s sentences and the vector from the embedding of target’s sentence; • _ _ : cosine distance between 30d UMAP projection of the average vector from the embedding of prompt’s sentences and the vector from the embedding of target’s sentence; • _ _ : average of the pairwise cosine distances between the prompt’s sentence embedding vectors and the target’s; • _ _ : maximum of the pairwise cosine distances between the prompt’s sentence embedding vectors and the target’s; • _ _ : the pairwise cosine distances between the vector embedding of the prompt’s last sentence and the target’s; • _ _ : minimum of the pairwise cosine distances between the prompt’s sentence embedding vectors and the target’s.

These features were then passed to a machine learning model that classifies whether the target sentences were coherently following the prompt text, thus tackling the Subtask 1 of the challenge.

The classifier model of choice was LGBMClassifier, a popular machine learning solution that combines computational eficiency with good predictive performances in various problems. The model was imported from the LightGBM gradient boosting framework that uses treebased learning algorithms [24], and the binary crossentropy was set as the objective function for the training. Model’s hyperparameters were selected by stratified 10fold cross-validation on shufled data [ 25], exhaustively searching the space of hyperparameters ( _ ∈ {24, 25, 26}, _ℎ ∈ {5, 6, −1} , _ ∈ {0.009, 0.01, 0.011}, _ ∈ {185, 190, 195} ) for the combination with best overall accuracy. This search space of hyperparameters was defined empirically, starting with intervals centred around extreme parameters ( _ 50 , _ℎ 10 , _ 0.01 , and _ 1000 ). Then, for every new training instance, those intervals were re-centred at the previous bestiftting value if such value stood at the extremity of the interval. Otherwise, the interval was made more narrow. The random seed was set to 42 for reproducibility.

The model’s performance was evaluated against the baseline provided by the challenge organizers. See [20] for more details.

3. Results

Training the system on the data available for Subtask 1, the model achieved an accuracy of 0.595 on the test set, improving upon the challenge baseline (0.525) by only 0.07 points. To provide some insights into these performances, the confusion matrix on the training data and the importance of the features are shown in Figure 1 and Figure 2, respectively. Finally, the relevant hyperparameters, resulting from the grid-search cross-validation, are reported in Table 1. tfidf_raw_wjac tfidf_raw_cos upper_case_word_count_t_over_tp tfidf_red_cos tfidf_red_euc emb_cos_mean tense_tar_per emb_cos_max word_density_t_over_tp punctuation_count_t_over_tp

sbert_umap1_pro char_count_t_over_tp word_count_t_over_tp

tense_iou sbert_umap2_pro

ent_iou sbert_umap2_tar sbert_red_euc sbert_red_cos ent_tar_per sbert_umap1_tar emb_cos_last emb_cos_min 0 10000

20000 30000 Feature importance [gain]

40000 not very informative in predicting coherence for the system employed. Yet “ _ _ ’’, the proportion of target’s entities also present in the prompt’s text, was rather 4. Discussion important. Semantic information seemed the most relevant, as supported by the many important features exThe features’ importance highlights that standard fre- tracted by leveraging the sentence embedding model. Of quency statistics (such as comparisons between prompt note, “ _1 _ ’’, the projection of the target’s and target on TF-IDF vectors and counts of upper case sentence embedding, was quite important although not words, tenses, punctuation, words, and characters) are derived from the comparison between prompt and target. Finally, the most important features resulted from the aggregation of the pairwise cosine distances between the sentence embedding of the target sentence against those from the prompt sentences. While the average and the maximum of these distances seemed not much important (“ _ _ ’’, “ _ _ ’’), the minimum distance and the distance between the target’s and prompt’s last sentences (“ _ _ ’’, “ _ _ ’’) were the two most important features. These findings suggest that coherence is elicited by one or a few proximal sentences, while the rest might be of secondary importance.

Our results suggest that including syntactic and discourse-level features might lead to improved performances. Syntactic features, such as part of speech tagging, dependency relationships, or parsing trees, can provide insights into sentence structure and overall grammatical coherence. Moreover, discourse-level features, such as entity co-reference, readability metrics, argumentative structure, discourse markers or topics progressionrelated features, could assess the flow of ideas in the documents provided. Future work will address the extraction of discourse structure and syntactic features to enable our model to assess a certain text’s logical connections, organization and grammatical structure.

The moderate performance improvement might also suggest limitations of standard machine learning models. This could be due to several reasons. First, these models have limited representation capacity of semantic relationships compared to deep learning models, as they lack the sequential modelling needed to represent a text’s underlying coherence. Moreover, they lack automatic contextual understanding, focusing more on provided features that might not capture the global context appropriately. Finally, they generally struggle with highdimensional data. If the number of features is high, as in

Acknowledgments

This work has been supported by the Horizon Europe VALAWAI project (grant agreement number 101070930).

We also wish to thank the Evalita-23 organizers for organizing the task and emphasizing the significance of text coherence measures for the Italian language. The creation and annotation of the corpus have been instrumental in advancing the field of Natural Language Processing for the Italian language and fostering community interest in coherence assessment. Your eforts will enable us to develop AI-driven methods for fostering comprehension assessment applied to both the infosphere and hybrid speech and language practices. An entity-based approach, Computational Linguis- [21] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprugtics 34 (2008) 1–34. noli, G. Venturi, Evalita 2023: Overview of the 8th [8] M. Elsner, E. Charniak, Extending the entity grid evaluation campaign of natural language processwith entity-specific features, in: Proceedings of the ing and speech tools for italian, in: Proceedings 49th Annual Meeting of the Association for Com- of the Eighth Evaluation Campaign of Natural Lanputational Linguistics: Human Language Technolo- guage Processing and Speech Tools for Italian. Final gies, 2011, pp. 125–129. Workshop (EVALITA 2023), CEUR.org, Parma, Italy, [9] B. J. Grosz, A. K. Joshi, S. Weinstein, Centering: 2023.

A framework for modelling the local coherence of [22] N. Reimers, I. Gurevych, Sentence-bert: Sentence discourse, IRCS Technical Reports Series (1995). embeddings using siamese bert-networks, arXiv [10] M. Lapata, R. Barzilay, et al., Automatic evaluation preprint arXiv:1908.10084 (2019). of text coherence: Models and representations, in: [23] L. McInnes, J. Healy, N. Saul, L. Grossberger, Umap: Ijcai, volume 5, 2005, pp. 1085–1090. Uniform manifold approximation and projection, [11] D. T. Nguyen, S. Joty, A neural local coherence The Journal of Open Source Software 3 (2018) 861. model, in: Proceedings of the 55th Annual Meeting [24] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, of the Association for Computational Linguistics W. Ma, Q. Ye, T. Liu, Lightgbm: A highly eficient (Volume 1: Long Papers), 2017, pp. 1320–1330. gradient boosting decision tree, in: I. Guyon, [12] H. C. Moon, T. Mohiuddin, S. Joty, X. Chi, A U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, unified neural coherence model, arXiv preprint S. Vishwanathan, R. Garnett (Eds.), Advances arXiv:1909.00349 (2019). in Neural Information Processing Systems, vol[13] M. Mesgar, M. Strube, A neural local coherence ume 30, Curran Associates, Inc., 2017. URL: https: model for text quality assessment, in: Proceed- //proceedings.neurips.cc/paper_files/paper/2017/ ings of the 2018 Conference on Empirical Meth- file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf. ods in Natural Language Processing, Association [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, for Computational Linguistics, Brussels, Belgium, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, 2018, pp. 4328–4339. URL: https://aclanthology.org/ R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D18-1464. doi:10.18653/v1/D18- 1464. D. Cournapeau, M. Brucher, M. Perrot, E. Duch[14] Y. Farag, H. Yannakoudakis, Multi-task learn- esnay, Scikit-learn: Machine learning in Python, ing for coherence modeling, arXiv preprint Journal of Machine Learning Research 12 (2011) arXiv:1907.02427 (2019). 2825–2830. [15] A. Beyer, S. Loáiciga, D. Schlangen, Is incoherence surprising? targeted evaluation of coherence prediction from language models, arXiv preprint arXiv:2105.03495 (2021). [16] Y. Farag, J. Valvoda, H. Yannakoudakis, T. Briscoe,

Analyzing neural discourse coherence models, arXiv preprint arXiv:2011.06306 (2020). [17] A. Lai, J. Tetreault, Discourse coherence in the wild:

A dataset, evaluation and methods, arXiv preprint arXiv:1805.04993 (2018). [18] L. Pishdad, F. Fancellu, R. Zhang, A. Fazly, How coherent are neural models of coherence?, in: Proceedings of the 28th International Conference on

Computational Linguistics, 2020, pp. 6126–6138. [19] A. Shen, M. Mistica, B. Salehi, H. Li, T. Baldwin,

J. Qi, Evaluating document coherence modeling, Transactions of the Association for Computational

Linguistics 9 (2021) 621–640. [20] D. Brunato, D. Colla, F. Dell’Orletta, I. Dini, D. P.

Radicioni, A. A. Ravelli, Discotex at evalita 2023: Overview of the assessing discourse coherence in italian texts task, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023), CEUR.org, Parma, Italy, 2023.

[1]

Jernite ,

S. R.

Bowman ,

Sontag , Discoursebased objectives for fast unsupervised sentence representation learning , arXiv preprint arXiv:1705.00557 ( 2017 ).

[2]

Wu ,

Hu , Learning to extract coherent summary via deep reinforcement learning , in: Proceedings of the AAAI conference on artificial intelligence , volume 32 , 2018 .

[3]

Li ,

Jurafsky , Neural net models of opendomain discourse coherence , in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Copenhagen, Denmark, 2017 , pp. 198 - 209 . URL: https://aclanthology.org/D17-1019. doi: 10 .18653/v1/ D17 - 1019.

[4]

Lin ,

H. T.

Ng , M.-

Kan , Automatically evaluating text coherence using discourse relations , in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , 2011 , pp. 997 - 1006 .

[5]

Zhang ,

V. W.

Feng ,

Qin , G. Hirst, T. Liu,

Huang , Encoding world knowledge in the evaluation of local coherence , in: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2015 , pp. 1087 - 1096 .

[6]

Albertin ,

Miaschi ,

Brunato , On the role of textual connectives in sentence comprehension: A new dataset for italian ., in: CLiC-it, 2021 .

[7]

Barzilay ,

Lapata , Modeling local coherence: