1. Introduction

A Modal Sense Classifier for the French Modal Verb Pouvoir

Anna Colli

Diego Rossini

Delphine Battistelli

0 0 Modyco laboratory, Paris Nanterre University , 200 Av. de la République, 92000 Nanterre , France 1 Paris Nanterre University , 200 Av. de la République, 92000 Nanterre , France

In this paper we address the problem of modal sense classification for the French modal verb pouvoir in a transcribed spoken corpus. To the best of our knowledge, no studies have focused on this task in French. We fine-tuned various BERT-based models for French in order to determine which one performed best. It was found that the Flaubert-base-cased model was the most efective (F1-score of 0.94) and that the most frequent categories in our corpus were material possibility and ability, which are both part of the more global alethic category.

eol>pouvoir modal verbs Modal Sense Classification BERT modality French

1. Introduction 2. Related work

In this paper, we present our research into the automatic The first study to focus exclusively on modal sense classidisambiguation of the French modal verb pouvoir (in En- fication was [ 1], who proposed logistic regression models glish, this verb can be translated by can, could, may or for each modal verb in English, based on an ensemble of might) in a corpus of semi-structured interviews1. This hand-crafted syntactic and lexical features. It was also problem statement is part of a broader quantitative and the first study to present an annotation scheme and an qualitative analysis currently underway on modal mark- annotated news domain corpus. Further studies pointed ers in order to better understand which kinds of modal out the problem of the biased distribution and sparsity categories are prevalent in this kind of corpus. As an NLP of data used in [1]. For example, two of these studies, [2] task, the problem of the automatic disambiguation of and [3], suggested creating a larger and balanced dataset modal markers relies on what is generally called “modal using a paraphrase projection approach from German sense classification” (MSC). As far as we know, no studies data (English-German parallel corpus of film subtitles have focused on disambiguating modal verbs using a ma- and proceedings from the EU Parliament). More specifichine learning approach in French. Our aim is to fill this cally, [2] updated the original feature set with semantic gap by finding the best fine-tuned BERT model to classify features. [3] also updated the original features of [1] the semantic values of the French modal verb pouvoir with lexical and discourse features to improve the perforin a transcribed spoken corpus. The article is organized mances of the classifiers; in addition, they explored the as follows. In section 2 we review related work on the influence of genre on the classification of modal verbs. task of modal sense classification. Section 3 describes Lastly, [4] proposed the most accurate and flexible alour corpus and our linguistic model. Section 4 presents ternative to classifiers based on manually engineered the annotation of the corpus with an annotation scheme. features. Their model is based on a CNN architecture and Section 5 presents our experiments in fine-tuning difer- is able to automatically extract features that are relevant ent BERT models in order to choose the most efective for classification (word embeddings). By adapting the one. Finally, in section 6 we discuss our results and in model to German, they demonstrated the model’s ability section 7 we close our contribution with conclusions and to generalize across diferent languages. [ 5] introduced suggestions for future research. another model architecture in which a simple classifier is fed with a combination of three sets of hand-crafted features and a concatenation of pre-trained embeddings of context words. This representation of the modal context was obtained by testing various weighting schemes. More CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, recent studies have attempted to solve the problem as a Dec 04 — 06, 2024, Pisa, Italy classical modal sense classification task by probing BERT ($D.aRnonsas.icnoil)l;i@dbpaattriisstn@anptaerrirsen.afrn(tAer.reC.forll(iD);.4B2a0t1t3is1t8e9ll@i)gmail.com architecture [6]. BERT-based models do not need a hand© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License crafted feature set and they are claimed to be better at 1The codAettriabuntidon 4t.0hIneternaantionnaol t(CaCtBeYd4.0)c. orpus is available on GitHub capturing contextual information than previous models. https://github.com/DiegoRossini/Modal-verbs-modality-detector. [7] showed that BERT does not have a unique representaThe model is available at https://huggingface.co/DiegoRossini/ tion for each modal sense, but, given the same semantic lfaubert-pouvoir-modality-detector value, BERT encodes it diferently for each modal verb. ers6 in the ES_CF corpus which contains globally 150.000 For this reason, individual classifiers for each verb per- modal markers. The marker pouvoir is a “highly polyform better than a classifier for each modal sense. Finally, semous” marker as it can potentially be part of three [8] used BERT’s last hidden layer representations of the categories: alethic, epistemic and deontic (see section 3.2 English modal verbs and their context to feed a k-nn and for their examination in detail). In order to determine logistic regression model. In addition, they tried to train the semantic value of each instance of polysemic modal a single common model for all the modal verbs but they markers, we propose a NLP approach for disambiguating showed that for some of them, including can and could, the modal verb pouvoir in its context. Our approach is this does not improve the results. [8] used the [1] and based on the linguistic model of [12]. [2] datasets and also introduced a new and richer dataset from COCA2, characterized by 5 genres including the 3.2. Linguistic model for analysing spoken genre.In general, BERT-based models outperform semantic values of pouvoir the frequency baseline and previous models for almost all modal verbs. Regarding French, as far as we know, no research has yet focused on the disambiguation of modal verbs using a machine learning approach. The only NLP approach is [9] which studied the notion of “possible” and adopted a symbolic approach with a set of rules to semantically annotate epistemic possibility. The present paper aims to fill this void by using a BERT architecture to solve the MSC task in a transcribed spoken French corpus. We present here the work carried out for the disambiguation of the modal verb pouvoir.

In French, several studies have focused on elucidating

the various contextual meanings of the modal verb pouvoir, e.g. ([13]; [14]; [12]). In order to build our annotation scheme (see section 4.1), we rely on the analysis presented in [12]. This is the model that was used in the ModalE tool used for extracting modal markers [10]. As mentioned in section 3.1, this tool assigns 3 possible global modal categories to pouvoir: alethic, epistemic and deontic. A deeper analysis of pouvoir, based on [12], led us to consider that this modal verb can have 6 possible refined modal categories (see table 6): 4 belong to the alethic category (descriptive judgements on a reality independent of the subject), 1 is part of the epistemic category (descriptive judgements referring to a subjective evaluation of the reality by the subject) and 1 belongs to the deontic one (prescriptive judgements based on institutions or systems of conventions). In [12], the values of “possibilité matérielle” (material possibility) and “capacité” (ability) are first [ 12, p. 442] presented as two distinct values, and later [12, p. 448] as part of a single one. Since this ambiguity is not resolved in Gosselin’s typology, we decided to treat them as two distinct values.

3. Corpus and linguistic model This section presents our corpus (3.1) and the linguistic model (3.2) on which the annotation scheme is based. 3.1. The ES_CF corpus

Our corpus – named here corpus ES_CF – is composed of 221 semi-structured interviews extracted from two diferent corpora 3. In the first corpus, named Eslo 4, we selected 207 interviews featuring questions to the citizens of Orléans about their habits and feelings regarding their city. In the second one, named CFPP5, we selected 4. Corpus annotation 14 interviews containing similar questions but focusing on the city of Paris. An automatic tool, named ModalE, In order to follow a supervised learning procedure, it described in ([10]; [11]), was employed to count the dif- is necessary to have a manually annotated corpus. We ferent modal categories that are present in these two describe here the process of manual annotation (4.1) and corpora. The tool is built on the typology proposed by the constitution of 4 diferent versions of our annotated [12]. Each French modal marker is associated with one corpus (4.2) that we used for the experiments detailed in or more modal categories depending on its more or less section 5. polysemous nature. The results indicate that the verb pouvoir is among the four most frequent modal mark2https://www.english-corpora.org/coca/ 3Among the diferent types of interviews and recordings which are present in these two corpora, we have extracted only the semistructured interviews between an interviewer and an interviewee 4https://www.ortolang.fr/market/corpora/eslo (700 recordings in total). 5https://www.ortolang.fr/market/corpora/cfpp2000 (60 recordings in total).

4.1. Annotation procedure

Table 2 presents the elements of our annotation scheme based on [12]’s typology summarized in table 6 (for a fuller version with examples and definitions, see A). Table 2 shows the 7 possible modal categories of pouvoir

6the others: “bien” (well) (7.3% of the total modal markers), “dire”

(to say) (6.9%), “savoir” (to know) (5.6%), “pouvoir” (4.94%). 4.2. Corpus preparation (the logical possibility category is included in the annota- Table 2 tion scheme even though we did not find any examples The 7 categories of pouvoir in the annotation scheme in our corpus). We have also added an “undetermined” global modal categories category, which includes the occurrences of pouvoir for which an annotator hesitates between two or more values and the ones that we were unable to annotate due alethic to a lack of context. We annotated 24 interviews from the ES_CF (17 from the Eslo corpus and 7 from the CFPP epistemic corpus) with an average length of 15,000 tokens. The an- deontic notation was carried out by three annotators (first author undetermined and two linguistic masters students) using Glozz [15]. We then calculated two inter-annotator agreements using Fleiss’ Kappa. The first one is called “strict” and includes the 6 values (excluding logical possibility). For the second one, denominated “broad”, we decided to merge “ability” and “physical possibility” into a single category called “physical possibility and ability” because of the ambiguity that persists in Gosselin [12]’s typology (see section 3.2), confirmed also by the frequent disagreement between annotators on these two categories. We obtained a result of 0.6 for the strict inter-annotator agreement and 0.66 for the broad inter-annotator agreement. Since the result of the broad inter-annotator agreement was better, we decided to adopt this version of the annotated corpus for training. The model was trained on all the categories except for logical possibility and the “undetermined” category. The total number of occurrences of pouvoir manually annotated in the corpus is 8797.8 modal categories sporadicity material possibility ability logic possibility eventuality permission undetermined we prepared 4 distinct datasets, each crafted to address specific challenges and enhance performance (see examples in C).

• Corpus Base: this dataset contains 776 sentences with at least one occurrence of pouvoir. Serving as our foundational dataset, it sufers from an imbalance in the distribution of modality categories.

This imbalance could bias the classifier toward more common categories, making it essential to address this issue in subsequent datasets. • Corpus Base Augmented: to rectify the imbalance observed in the "corpus base", we created this augmented dataset containing 1716 sentences. We employed data augmentation using the cc.fr.300.bin model and the gensim library for lexical substitution. This process balanced the distribution of modality categories, resulting in a more evenly distributed training set for our classifier. • Corpus Context: considering the significant influence of surrounding context on the meaning of the modal verb pouvoir we constructed a third dataset (776 sentences with context). This dataset includes sentences with pouvoir along

In order to efectively train and evaluate our classifier for

detecting the semantic value of the French verb pouvoir,

7sporadicity (71 occurrences), material possibility or ability (448),

eventuality (131), permission (229) 8The annotated corpus is available on GitHub: https://github.com/ DiegoRossini/Modal-verbs-modality-detector with one speaker’s phrase before and after, ofering a broader contextual framework to help the classifier better understand the modal sense of pouvoir and make more accurate predictions (see . • Corpus Context Augmented: this fourth and ifnal dataset combines the benefits of both data augmentation and expanded contextual framing (1716 sentences with context).

5. Experiments and results In our experiments, the primary objective was to identify

the most efective configurations regarding training data and model selection for the token classification of the French modal verb pouvoir. We chose to perform token classification to isolate occurrences of pouvoir, enabling us to label them with the specific categories we developed.

The primary evaluation metric used across these tests was the F1-score, which harmonically combines precision and recall. This metric is particularly crucial in scenarios such as ours where class imbalance is significant; over 97% of the dataset constituted the non-pouvoir class labeled "O". This label was used to mark all tokens that did not correspond to instances of pouvoir, allowing the model to focus specifically on identifying and classifying the modality of pouvoir’s occurrences. tiveness in the modal classification of the French verb pouvoir. Throughout this phase, we maintained the stratified 80-20 split for training and testing, ensuring that the 20% test set remained unseen for final evaluations.

For all models tested, the training set was subjected to 5-fold cross-validation during training to leverage its demonstrated benefits. As shown in table 3, the best performing model was the flaubert-base-cased which achieved an F1-score of 0.94 and 0.92 when the "O" class was excluded9. One possible reason for its superior performance could be attributed to the extensive and diverse pretraining corpus it was trained on, which is specifically designed to capture various nuances of the French language. Given that our dataset is based on oral corpora, 5.1. Training Data selection the flaubert-base-cased model may be particularly wellsuited for this type of data, as the other models have been Initially, the corpus listed in 4.2 was experimented upon trained on less diversified data forms. In the final evaluausing the camembert-base model with a stratified train- tions, the flaubert-base-cased model demonstrated strong validation-test split of 80-10-10 over seven epochs to performance in identifying non-modal occurrences and determine the most efective training data. This split distinguishing specific modalities such as "eventuality" allowed us to monitor model performance on a small val- and "permission" (see confusion matrix and results per idation set during training, and the augmented context category in appendix B). However, it encountered some corpus (corpus_context_augmented) proved to be supe- challenges with the "material possibility or ability" caterior, achieving an F1-score of 0.90 in evaluation and 0.88 gory, indicating slight semantic overlaps. The confusion when the "O" class was excluded. These results indicated matrix corroborates these findings, showing minimal misthat data balancing coupled with contextual enhance- classifications, particularly between categories such as ments significantly benefits model performance. After "material possibility or ability”. This final analysis highidentifying the corpus_context_augmented dataset as lights that holistic advancements in both model selection the optimal choice, we applied a 5-fold cross-validation and detailed category definition refinement are crucial. strategy to evaluate the model’s robustness. This cross- By leveraging models optimized for the French language validation process was conducted on the 80% training such as FlauBERT, alongside meticulously curated and portion of the dataset, while the 20% test set remained un- balanced training data, the task of modality classification touched. Cross-validation yielded further improvements for pouvoir is approached with an increasingly nuanced in model performance, solidifying the combination of the understanding and precision, promising further enhancecorpus_context_augmented dataset and the camembert- ments and consistency in future NLP applications of the base model as our most reliable setup. same kind.

5.2. Model performance comparison After determining the optimal training data setup, we

tested various pre-trained models to assess their efec

9The model is available at https://huggingface.co/DiegoRossini/

lfaubert-pouvoir-modality-detector 10for RoBERTa see https://huggingface.co/FacebookAI; for DistilBERTseehttps://huggingface.co/distilbert; for CamemeBERT see https://huggingface.co/almanach; for FlauBERT see

6. Discussion For example, the model classifies Example 3. as “possi

bilité matérielle et capacité” even though the institution The semantic substitution process was particularly chal- (i.e., "headquarters") granting permission to the subject is lenging due to the resource-intensive nature of avail- clearly mentioned. The solution will be to enrich the data able models such as FastText11 and the complexity of of deontic pouvoir with some examples of diferent struchandling text derived from spoken language. Our ap- tures. To address this problem, it would be necessary to proach involved using Spacy to capture verbs, determin- enrich and to vary, in terms of structures, the examples in ing the most semantically similar verbs with FastText, the deontic category. Finally, we tested our model on all and then conjugating them to match the form of the the 221 interviews in the ES_CF corpus. The results show original verbs. This sequence of operations proved ex- that most instances of pouvoir belong to the category of tremely resource-demanding and dificult to implement. physical possibility or ability (51% of pouvoir instances), Additionally, Spacy and FastText both demonstrated sig- followed by permission (35%), eventuality (9%) and sponificant dificulties with the French language, leading to radicity (5%). In general, the most representative modal several inconsistencies during lexical substitution. These category is the alethic one (value of material possibility ifndings underscore the need for more robust, language- and ability and sporadicity: 56%). These results are conspecific tools to improve the accuracy and eficiency of sistent with those we obtained in the manually annotated semantic substitution in NLP tasks involving French, par- portion of the ES_CF corpus presented in section 4.1. ticularly with spoken text.

If we take a closer look at the model’s results, we notice that “permission” is the second best classified category 7. Conclusion with an f-score of 0.95. However, a qualitative analysis of the classified sentences revealed some incongruences.

Among the various uses of pouvoir with the value of permission, there are two that are very frequent (40% of permission annotations) and have a typical structure.

These are the “pouvoir of politeness” (see Ex. 1.), a question that allows the subject to express a request politely, and the expression “je/nous/on” (I/we/impersonal pronoun “on” ) + “pouvoir” + “dire” (to say) , called “pouvoir_dire” (see Ex. 2.).

(1) Euh attends j’ai un train de retard tu peux répéter ? (Uh, wait, I’m a bit behind, can you repeat that?) (ESLO2_ENTJEUN_1235) (2) Enfin j’ai fait essentiellement des mesures on peut dire (Well, I mostly took measurements, one could say [...]) (ESLO2_ENT_1014) Our model is biased by the fact that most of the permission pouvoir follow one of these two patterns that are characterized by a fixed structure: the model is not able to identify as pouvoir of permission any use that is diferent from 1. or 2.

(3) Je suis nommé par le siège qui peut du jour au lendemain si je ne fais pas le travail me me basculer. (I am appointed by headquarters, which can, from one day to the next, if I don’t do the job, toss me out.) (ESLO1_INTPERS_438) https://huggingface.co/flaubert; for BERT-base-multilingual: https://huggingface.co/google-bert 11https://fasttext.cc/

This study demonstrates significant first progress in the

automatic classification of the French verb pouvoir by ifnding the best fine-tuned BERT model. Moderate to substantial inter-annotator agreement led to merging some subcategories for more streamlined annotations.

The flaubert-base-cased model, with contextual data augmentation, achieved an impressive F1-score of 0.94 with cross-validation, highlighting the importance of context (see section 4.2 “Corpus Context”). However, challenges persist, such as limited training data and the need for better annotation tools and more powerful computational resources. The model struggles with certain deontic usages that humans easily identify. Intentional ambiguity by the speaker also poses a challenge for both annotators and the model. Future work should expand and enrich the dataset and consider training on full texts instead of isolated sentences to capture context better. [8] propose a similar approach, emphasizing the importance of taking a large context around the target token and advocating for the use of full texts as context. In the future, we will also experiment with an augmented context window of 10 lines before and after the target token. These enhancements will improve model robustness and set the stage for further advancements in natural language processing, particularly for classifying semantic values of French modal verbs. This is the first step in a larger project that will soon include the verb devoir (must). More globally, the ultimate goal of our approach is to be able to identify which modal categories are prevalent in any given corpus [16]. Indeed, given that the verb pouvoir is present in all types of texts, the ability to identify its modality becomes a necessary tool for refining the overall analysis of modality in diferent tasks such as sentiment analysis ([17] or hedge detection ([18]). elsevier.com/retrieve/pii/S1532046410001140.

doi:10.1016/j.jbi.2010.08.003.

A. Annexe A: Extended version of annotation examples of the 7 semantic values of pouvoir Parfois dramatique comme les les romans qui peuvent rappeler des situations plus ou moins pénibles. (Sometimes dramatic, like novels that can evoke more or less painful situations) (ESLO1_ENT_003_C) C’est un un personnage donc il y a des choses que vous ne pouvez pas faire uniquement avec du verre et du plomb par exemple ces cheveux-là le nez la bouche oui. (It is a character, so there are things you cannot do with just glass and lead, for example, the hair, the nose, the mouth, yes.) (ESLO1_ENT_002_C) À l’intérieur on a une galette on a un gâteau on le partage en X morceaux on peut pas le le faire grandir par le le un coup de baguette magique. (Inside, we have a cake, we share it into X pieces, we cannot make it grow with a wave of a magic wand.) (ESLO1_INTPERS_421_C) ø Les payer pour qu’ils euh fassent leur boulot et euh qu’on donne un un prix euh au meilleur grapheur money price et on prend cinq mille euros ça pourrait être pas mal. (Pay them so they, uh, do their job and, uh, give a, uh, prize, uh, to the best grafiti artist, money prize, and we take five thousand euros, that could be nice) (ESLO2_ENTJEUN_1228_C) Euh les gens sont libres de venir consulter quelque médecin que ce soit et ils peuvent en changer à tout moment et que donc euh après être venus me consulter euh si je ne leur plais pas. (Uh, people are free to consult any doctor they choose and they can change at any time, and so, uh, after coming to see me, uh, if they don’t like me.) (ESLO1_ENT_003_C) C’est ça ? justement je me dis comment est-ce que je vais pouvoir utiliser mes capacités informatiques ? (That’s it? Exactly, I’m wondering how I will be able to use my computer skills?) (ESLO2_ENTJEUN_1235_C) Parce que sinon on aurait pu ... (Otherwise, we could have...) (CFPP, Catherine_Lecuyer) B. Annexe B: confusion matrix of the best model’s results

C. Annexe C:

Datasets Corpus_base (1 example = 1 oral speech turn) Corpus_Base_Augmented (from a Corpus Base example another is created performing lexical substitution) Corpus_Context (1 exemple = 1 oral speech turn + the oral speech turn before and the oral speech turn after) Corpus_Context_Augmented (from a Corpus Context exemple another is created performing lexical substitution)