Building Foundations for Inclusiveness through Expert-Annotated Data

Building Foundations for Inclusiveness through Expert-Annotated Data MorenoLa Quatra moreno.laquatra@unikore.it Kore University of Enna

Enna Italy

SalvatoreGreco salvatore_greco@polito.it Politecnico di Torino

Turin Italy

LucaCagliero luca.cagliero@polito.it Politecnico di Torino

Turin Italy

MichelaTonti michela.tonti@unibg.it Università degli studi di Bergamo

Bergamo Italy

FrancescaDragotto francesca.dragotto@gmail.com Università degli Studi di Roma Tor Vergata

Rome Italy

RacheleRaus rachele.raus@unibo.it Università di Bologna

Bologna Italy

StefaniaCavagnoli stefania.cavagnoli@uniroma2.it Università degli Studi di Roma Tor Vergata

Rome Italy

TaniaCerquitelli tania.cerquitelli@polito.it Politecnico di Torino

Turin Italy

Building Foundations for Inclusiveness through Expert-Annotated Data 1613-0073 7A56D47F8EB7FB68981B6B0C57BB6F1A GROBID - A machine learning software for extracting information from scholarly documents inclusive language natural language processing text generation deep learning

Natural Language Understanding and Generation models suffer from a limited capability of understanding the nuances of inclusive communication as they are trained on massive data, often including significant portions of non-inclusive content. Even when the models are specifically designed to address non-inclusive language detection or reformulation, they disregard, to a large extent, inclusivenessrelated features that are likely correlated with the inclusive language nuances, such as the discourse type, level of inclusiveness, and intended context of use. To assess the importance of additional inclusiveness-related features, we collect a new corpus of Italian administrative documents humanly annotated by linguistic experts. Linguistic experts not only highlight non-inclusive text snippets and propose possible reformulations, but also annotate multi-aspect labels related to different inclusive language nuances. We empirically show that a multi-task learning approach that leverages the multi-aspect annotations can improve the non-inclusive text reformulation performance, thereby confirming the potential of expert-annotated data in inclusive language processing.

Introduction

Non-inclusive expressions are widespread in humanly written documents [1]. Training Natural Language Understanding and Generation models on massive data exposes them to bias issues related to language inclusiveness. Addressing this issue is particularly relevant because Artificial Intelligence (AI)-based solutions must be used responsibly to correctly model inclusive language practices and not unintentionally marginalize or disadvantage certain groups.

To mitigate the presence of bias in data, applications based on AI rely on human supervision for model training and post-processing evaluation. This is quite common in the areas of Natural Language Understanding and Generative AI, in which applications like Large Language Models (LLMs) provide end-users with conversational and language editing services [2].

The computational linguistic community has agreed on the need to leverage human expert annotations in experience-based learning for bias detection and mitigation [3,4,5,6]. However, the linguistics literature often underestimates the importance of linguistic annotators because of the widespread tendency to value the figures of preand post-editors [7,8]. Editing and annotation are substan-tially different: while language editing tools rewrite parts of the source text based on predefined expert-provided rules, Natural Language Understanding and Generation models can leverage annotations to capture the nuances of annotated text in a self-supervised manner. The use of textual annotations also relieves annotators of the task of explicitly formulating or adhering to ad hoc linguistic rules.

In the context of inclusive language understanding and generation, most of the previous work exploits rule-based or round-trip translations to annotate texts for inclusivity issues [9,10,11,12]. However, these works often overlook the significance of human expert annotations, opting instead for rule-based approaches or artificially created datasets generated through round-trip translations. The role of linguistic annotators in providing specific understanding and annotations of language data is crucial for developing more inclusive AI models [13,14].

A limited body of work has been devoted to generating and exploiting multi-faceted expert human annotations to drive AI models for inclusive language, e.g., [15,16,17]. However, existing benchmarks of annotated text for inclusive language processing neglect potentially relevant aspects such as the level of inclusiveness, the intended context of use, and the text genre. These aspects have the potential to improve the inclusive language understanding and generation capabilities of AI models.

This paper proposes an expert-annotated dataset covering these new aspects and investigates their usefulness in enhancing the performance of the task of non-inclusive text reformulation in the absence of rule-based editing models.

To this end, we enrich a corpus of Italian administrative documents with multi-aspect annotations, providing more insights into the inclusive language nuances. The purpose is to enable the study of new features describing inclusiveness aspects neglected by existing approaches, such as the level of inclusiveness, register, and genre. By enriching the language descriptions with new inclusiveness-related features, we provide the research community with new resources to enhance the understanding and writing capabilities of AIbased solutions.

We also collect preliminary results on the use of multiaspect annotations in a multi-task learning approach to enhance non-inclusive language reformulations. The results confirm the potential of the inclusiveness-related expert annotations.

The annotation process

The term annotation is often used to indicate the process by which textual data are subjected to a tightly interrelated two-phase activity [6]: a) Identification, selection, and localisation of specific documents, and b) Interpretation and labeling of those documents. The first phase entails identifying and detailing the text segments that exhibit the linguistic phenomenon under investigation. Subsequently, in the interpretation phase, the selected occurrences are humanly labeled. These annotations may encompass various forms ranging from a selection of pre-established alternatives to free-text comments or possible reformulations.

Unlike human annotators, AI models often lack cognitive abilities such as common sense reasoning and generalization capabilities due to the relatively limited numbers of linguistic examples used for model training compared to the impressive variety of natural language forms.

Human annotators need sufficient expertise to interpret nuanced phenomena and assign appropriate labels adequately. Their annotations are at the base of a supervised learning process. The trained models can progressively learn from annotated data as automatized humans do, but at a scale not possible through manual work alone.

Annotation of Italian administrative documents.

We have designed and utilized a novel benchmark dataset for inclusive language writing in Italian. This dataset comprises administrative communications sourced from the Italian public administration, spanning across both national and regional levels. We annotate the corpus at the sentence level. To this end, we set up a heterogeneous team of 13 linguistic experts with diverse experiences and expertise in inclusive language. The team consists of predominantly female individuals, all native Italian speakers. All the annotators are educated: 57% have at least 10 years of experience in linguistics, and 50% have at least 3 years of experience in inclusive language. In addition, the annotators received, on average, about 30 hours of training specific to inclusive language annotations.

Each human annotator independently assigns inclusiveness-related metadata to the document sentences. Each sentence can be enriched with multiple annotations. The annotations consist of (a) The reformulation of any non-inclusive piece of text, i.e., an alternative inclusive form; (b) The level of inclusiveness of the input sentence indicating whether a sentence is non-inclusive, inclusive, or not pertinent; (c) The register or intended context of use, i.e., Standard, Specialized, or Informative/Educational; (d) the discourse type or genre, i.e., Legal, Administrative, Technical, or Informative/Educational.

Additional contextual aspects could be included in future annotations to enhance models' understanding of inclusive language usage further. By jointly providing those annotations, the experts aimed to capture inclusive language's nuanced, multi-faceted nature.

By learning language inclusiveness patterns from a diversified, context-dependent set of expert annotations, AI models gain exposure to subtle interpretive differences. The consistency across annotations is ensured through detailed guidelines and instructions provided to experts. Before full annotation, a collaborative analysis of a sample set identifies any divergent interpretations to refine guidance. Example of annotations. Table 2 shows an example of an Italian annotated sentence (as well as the corresponding English translation for non-Italian readers). Linguistics experts assign different annotations to each sentence. In this example, they have assigned three labels to the sentence. Regarding inclusiveness, the sentence has been categorized as non-inclusive because it contains "Il Presidente" (i.e., Chair/President) and "Rettore" (i.e., Rector), which are masculine declensions of professional roles. In addition, the sentence also contains "suo decreto", which refers to a decree that comes from a male person, so the sentence is not inclusive. The discourse sequence is of the administrative type, as the content refers to an administrative topic, and the used language is specialized, as the content describes specific and technical aspects.

Statistics on annotated data.

Case study: Leveraging Aspects for Italian Inclusive Language Reformulation

We conduct an empirical analysis to examine the impact of utilizing expert annotations in inclusive language generation. Specifically, we investigate the advantages of simultaneously addressing two key objectives: reformulating non-inclusive language and predicting various aspects of inclusiveness.

Tasks. Given a non-inclusive piece of text 𝑇 , the Non-Inclusive Language Reformulation (NILR) task aims at generating an equivalent inclusive natural language form. The NILR task is a sequence-to-sequence problem, where the input is a non-inclusive sentence and the output is the corresponding inclusive sentence.

Given 𝑇 and an aspect 𝐴, the goal is to predict the 𝐴's value for 𝑇 . 𝐴 can be the level of inclusiveness, register or intended context of use, and discourse type or genre. According to the aspect under analysis, the corresponding sub-tasks are denoted by Inclusiveness Level Classification (ILC), Register Classification (RC), and Genre Classification Performance comparison between Single-and Multi-task Learning approaches in inclusive language generation, evaluated based on ROUGE scores (R-1, R-2, R-L) and human evaluation.

(GC). The ILC, RC, and GC tasks are treated as separate classification problems, where the input is a sentence and the output is the corresponding aspect value.

Single-vs. Multi-Task Learning

To compare the performance of models trained using different learning approaches, we conducted experiments in both single-task and multi-task learning settings.

In Single-Task Learning, we exclusively focus on the task of Non-Inclusive Language Reformulation (NILR), disregarding all aspect-related annotations. We leverage an encoderdecoder architecture, specifically BART-IT [18], which is a BART architecture [19] pre-trained on a clean Italian corpus [20]. The model is fine-tuned on the NILR task with the twofold objective of modifying the input sentence to make it inclusive while maintaining the original meaning.

Conversely, in Multi-Task Learning, we integrate the NILR task with Aspect Classification tasks during training (i.e., ILC, RC, and GC). For the additional tasks, we specifically leverage the encoder component of the model, which extracts representations of the input text. The encoder component is additionally trained with a classification objective. Each task is associated with a separate classification head, trained to predict the corresponding aspect value for the input sentence. By interleaving these tasks during training, the model learns to simultaneously address NILR and create encoder representations that capture various aspects related to inclusiveness. Evaluation Metrics. We evaluate the quality of the text reformulation using a standard train-validation-test split on our expert-annotated data. To compare the automatically generated and expected reformulations, we use the established ROUGE F1-scores [21]. They measure the unit overlap, in terms of the number of n-grams in common, between the two pieces of text. The larger the score, the higher the syntactic similarity. R-1, R-2, and R-L count the unit overlap in terms of unigrams, bigrams, and longest common subsequences, respectively.

To complement the quantitative evaluation, we also perform a qualitative evaluation of the achieved results. We involved six human evaluators who were asked to label each model-generated sentence as: correct if it accurately maintained the original meaning while using inclusive language appropriately for the context; partially correct if some aspects were reformed correctly, but others were missed or inaccurate; or not correct if the rewriting fundamentally failed to capture the original meaning or usage intention. This multi-level feedback aims at capturing the models' ability to perform the rewriting task sensitively across different scenarios beyond just string-matching metrics.

To each reformulation, we assign a score to each annotation as follows: 1 for correct, 0.5 for partially correct, and 0 for incorrect. The final score for each reformulation is computed as the average over all the expert annotations (𝑚 = 6). Finally, we average the scores for all the reformulations (𝑛 = 30) to obtain a single score for each model.

Results' overview. Columns 2, 3, and 4 in Table 3 show the ROUGE scores for both models. The multi-task learning achieves the best performance on all the quantitative metrics. Regarding the human evaluation, we obtained 6 annotations for 30 reformulations for each model. For the model trained with the single task configuration, 93 were correct, 55 were partially correct, and 32 were incorrect. Instead, for the multi-task model, 101 were correct, 49 were partially correct, and 30 were incorrect. Column 5 reports the average human evaluation scores for both models. The human scores are coherent with the quantitative ones, showing that the model trained under multi-task settings benefits from the additional labels. Based on these preliminary results, we can conclude that the nuanced and multidimensional annotations of inclusive language have the potential to develop a more comprehensive approach to modeling inclusive language.

Conclusions

This paper discussed and experimentally demonstrated that the role and contribution of human annotators are of paramount importance in improving the quality of NLP results and the writing capability of generative approaches in inclusive communication. Starting from a new Italian administrative corpus, we enriched it with a variety of annotations with the help of a team of language experts. This included (i) reformulating gendered language and acronyms, (ii) rewriting to enhance readability for the visually impaired, and (iii) defining the intended context of use (register) and text genre. The preliminary experimental results on the annotated corpus are promising and highlight the potential of the newly proposed annotations to develop a more comprehensive and richer approach that improves the ability of the generative algorithm to propose comprehensive and integrative reformulations.

Limitations. i) The annotation is language-specific, limited to the Italian language, thereby constraining its utility in multilingual scenarios; and ii) It is formal communicationspecific. Tailored to tackle the challenge of inclusive language in administrative and academic settings, the natural language tasks are exclusively trained on administrative documents, potentially lacking suitability for diverse contexts like legal and web communications.

Future work.

As part of the E-MIMIC 1 (Empowering Multilingual Inclusive Communication) project, we are currently working on a multilingual annotation process to overcome these issues and foster inclusive communication across different domains and languages. A team of experts is annotating a large corpus of documents according to linguistic criteria to label linguistic resources in a multilingual setting.

Finally, we want to exploit text-based explainability techniques [22,23] to perform further human validation of the models produced.

Ethical Considerations. All the gathered documents are public and therefore freely accessible on the internet. All references to proper names of people and institutions have been anonymized and replaced with random names for privacy reasons.

Table 1 reports the number of annotated sentences for each aspect, separately for the training, validation, and test sets.Task ID Train Validation TestNILR6491956579ILC92071421866RC2167338247GC2166338248Table 1Statistics on data. NILR=Non-Inclusive Language Reformulation,ILC=Inclusiveness Level Classification, RC=Register Classifica-tion, GC=Genre Classification.

Table 22Example of sentence annotations illustrating non-inclusive language reformulation in Italian (IT) and English (EN), along with corresponding inclusiveness classification, discursive sequence, and clear language class.SettingR-1R-2R-LHuman EvalSingle-Task74.9564.0974.790.67Multi-Task75.58 64.37 75.360.70

Table 33

Acknowledgments

This study was carried out within the project "E-MIMIC: Empowering Multilingual Inclusive Communication", funded by the Ministero dell'Universitá e della Ricerca -with the PRIN 2022 (D.D. 104 -02/02/2022) program.

Three recommended inclusive language guidelines for scholarly publishing: Words matter SJAshwell PKBaskin SLChristiansen SADibari AFlanagin TFrey RJemison MRicci 10.1002/LEAP.1527 Learn. Publ 36 2023 Automatic identification of harmful, aggressive, abusive, and offensive language on the web: A survey of technical biases informed by psychology literature ABalayn JYang ZSzlávik ABozzon 10.1145/3479158 ACM Trans. Soc. Comput 4 56 2021 Bias decreases in proportion to the number of annotators RArtstein MPoesio Proceedings of FG-MoL 2005: The 10th conference on Formal Grammar and The 9th Meeting on FG-MoL 2005: The 10th conference on Formal Grammar and The 9th Meeting on 2009 139 Inter-coder agreement for computational linguistics RArtstein MPoesio Computational linguistics 34 2008 Assessing agreement on classification tasks: The kappa statistic JCarletta Computational Linguistics 22 1996 What determines inter-coder agreement in manual annotations? a meta-analytic investigation PSBayerl KIPaul 10.1162/COLI_a_00074 Computational Linguistics 37 2011 Dalla zairja alla traduzione automatica: riflessioni sulla traduzione nell'era digitale JMonti 2019 Loffredo Selecting and preparing texts for machine translation: Pre-editing and writing for a global audience, Machine translation for everyone: Empowering users in the age of PSánchez-Gijón DKenny artificial intelligence 18 81 2022 User-centric gender rewriting BAlhafni NHabash HBouamor 10.18653/v1/2022.naacl-main.46 Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics

Seattle, United States

2022 Exploiting biased models to de-bias text: A gender-fair rewriting model CAmrhein FSchottmann RSennrich SLäubli 10.18653/v1/2023.acl-long.246 Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Toronto, Canada

Association for Computational Linguistics 2023 theirs: Rewriting with gender-neutral english TSun KWebster AShah WYWang MJohnson They Them CoRR abs/2102.06788 2021 Neu-Tral Rewriter: A rule-based and neural approach to automatic rewriting into gender neutral alternatives EVanmassenhove CEmmery DShterionov 10.18653/v1/2021.emnlp-main.704 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and

Punta Cana, Dominican Republic

2021 Gender neutralization for an inclusive machine translation: from theoretical foundations to open challenges APiergentili DFucci BSavoldi LBentivogli MNegri Proceedings of the First Workshop on Gender-Inclusive Translation Technologies, European Association for Machine Translation the First Workshop on Gender-Inclusive Translation Technologies, European Association for Machine Translation

Tampere, Finland

2023 Beyond obscuration and visibility: Thoughts on the different strategies of genderfair language in italian MRosola SFrenda ATCignarella MPellegrini AMarra MFloris Proceedings of the 9th Italian Conference on Computational Linguistics the 9th Italian Conference on Computational Linguistics

Venice, Italy

CEUR-WS 2023. November 30-December 2, 2023. 2023 3596 CLiC-it E-mimic: Empowering multilingual inclusive communication GAttanasio SGreco MLa Quatra LCagliero MTonti TCerquitelli RRaus 2021 IEEE International Conference on Big Data (Big Data) IEEE 2021 Inclusively: An ai-based assistant for inclusive writing MLa Quatra SGreco LCagliero TCerquitelli Joint European Conference on Machine Learning and Knowledge Discovery in Databases Springer 2023 L´analyse du discours et l´intelligence artificielle pour réaliser une écriture inclusive : le projet emimic RacheleRaus Tonti Michela Cerquitelli Tania Cagliero Luca GiuseppeAttanasio LaQuatra Moreno SalvatoreGreco 10.1051/shsconf/202213801007 SHS Web Conf 138 1007 2022 Bart-it: An efficient sequence-tosequence model for italian text summarization LaQuatra Cagliero 10.3390/fi15010015 Future Internet 15 15 2022 BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension MLewis YLiu NGoyal MGhazvininejad AMohamed OLevy VStoyanov LZettlemoyer 10.18653/v1/2020.acl-main.703 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 GSarti MNissim arXiv:2203.03759 It5: Large-scale text-to-text pretraining for italian language understanding and generation 2022 arXiv preprint ROUGE: A package for automatic evaluation of summaries C.-YLin Text Summarization Branches Out, Association for Computational Linguistics

Barcelona, Spain

2004 Inseq: An interpretability toolkit for sequence generation models GSarti NFeldhus LSickert OVan Der Wal MNissim ABisazza 10.18653/v1/2023.acl-demo.40 Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Association for Computational Linguistics the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Association for Computational Linguistics

Toronto, Canada

2023 Trusting deep learning natural-language models via local and global explanations FVentura SGreco DApiletti TCerquitelli 10.1007/s10115-022-01690-9 Knowl. Inf. Syst 64 2022