1. Introduction

Profiling Anonymous Authors in the Corsican Autonomist Press of the Interwar Period

Vincent Sarbach-Pulicani

0 0 Université Côte d'Azur, Centre de la Méditerranée Moderne et Contemporaine , Campus Carlone, 06100 Nice , France

2023

78 99

With the emergence of nationalism in the 1t9h century came regionalist movements to assert and claim cultural particularities. Corsica 昀椀tted very well within this dynamic and even presented itself as a favourable location for the development of such ideas. The centralization of the state around a strong capital and the policies of assimilation of the indigenous populations on the border with France led certain players to defend these particularisms. It was in this context that the Corsican autonomist newspaper A Muvra was born in May 1920 in Paris, under the impetus of Petru and Matteu Rocca. For almost 19 years, hundreds of authors participated in the writing of this massive dialectal work. This paper presents the results of a research that aimed to carry out author pro昀椀ling, i.e., to determine the style and subjects covered by an author. The goals of this study were to determine the identity behind certain authors and also to highlight the role pseudonyms played in the newspaper's propaganda. We conducted authorship attribution to achieve the 昀椀rst objective before completing these analyses with topic modelling in order to meet the second one.

eol>stylometry topic modelling corsican studies under-ressourced languages computational history

1. Introduction

a desire in Corsica to structure and study the evolution of the use of the Corsican language. We can notably mention the work of the linguist Marie-José Dalbera-Stefanaggi with hNeoruvel atlas linguistique et ethnographique de la Corse. In the republications of this major work in the 2000s, the author incorporated her work on the creation oBfaanque de Données Langue Corse (BDLC).1 This is the 昀椀rst initiative to lemmatise the Corsican language in its diachrony and diatopy.2 Since the second part of the 2010s, there has been a signi昀椀cant increase in scholars’ thoughts on the tooling of regional languages using NL1P6[]. Our approach is fully in line with this state of the art. This paper presents is the continuation of a master’s thesis written as part of a double degree programme between the École nationale des chartes of Paris and the Università di Pisa 2[ 4 ]. It follows a 昀椀rst thesis that highlighted the major ideological di昀erences between the corsists and the irredentists, despite their obvious proximit2y5.][It resulted in the creation a database namedAutonomists/Irredentists Database (A/I database).3. This work establishes that if the corsists admitted to being part of a common cultural and linguistic entity with Italy, they did not share the same desire for political uni昀椀cation, even if some autonomists came closer to Fascist ideas just before the beginning of the Second World War.

Like any political pressA, Muvra has a large number of anonymous authors writing under pseudonyms. While the possibility of individual authors exists, there is also a good chance that these pseudonyms are the result of recurring authors of the journal publishing under their real names. This raises a number of questions about the identity of these anonymous authors as well as the role that a corsist gives to one or more of his pseudonyms. Several preliminary hypotheses can be proposed at this stage, including the deliberate exaggeration of the number of activists, the need for protection against censorship, or the desire to express varying viewpoints. In order to address these inquiries, we will employ two distinct analytical methods. First, we will utilise stylometry to unveil the identities of anonymous authors, and secondly, we will apply topic modelling to gain insights into the themes associated with these pseudonyms. Subsequently, we will engage in an interpretive phase to discern the purpose and characterization an author assigns to their pseudonym. These dual layers of analysis ultimately encapsulate the concept of author pro昀椀ling, as previously discussed. The analyses and results are all available on a GitHub repository dedicated to this researc2h6[].

2. Datasets construction: starting from scratch 2.1. The OCR processing

The main issue surrounding the analysis of such a review is the accessibility of the data. In order to carry out the analyses, the data had to be acquired from the digitised images of the newspapers. Segmentation and OCR presented signi昀椀cant challenges, as well as postprocessing and normalisation (see an example of a front page with Figu1r2e). We were able to locate two online platforms where our documents are available for download. The images come from two sources: theBibliothèque nationale de France (BnF) and the Archives départementales de Corse 1https://bdlc.univ-corse.fr/bdlc/corse.php 2This database, which includes a wide range of possibilities, was created on the basis of a vast and particularly impressive 昀椀eld survey. 3https://heurist.huma-num.fr/heurist/?db=vsp_presse_corsiste_irredentiste du Sud (ADC). So we used Gallica, the digitization platform of the BnF, and THOT, the platform of the ADC.4 The fact that these are national institutions means that the digitizations are in the public domain, i.e., open source. A昀琀er the phase of webscrapping, we got a collection of 375 issues of the Muvra, i.e., approximately 1500 pages from 1921 to 1931.

One of the problems with having images from two di昀erent sources is the quality of the images. This raises the question of whether or not it is appropriate to normalise and clean images in order to facilitate OCR processing. The original idea of our research was to clean the documents using binarization with the Otsu method1[ 9 ] followed by a despeckling phase. The “speckling” is a type of noise that corresponds to random clusters of black pixels that impair the intrinsic quality of a binarized image12[]. However, the quality of the digitizations, especially from theArchives départementales, varies greatly. While sharpness is not the main problem, it is more a question of stains on the paper or pages damaged by time. This is an inherent problem in the conservation of old newspapers; paper is cheap and not made to last over time. The conservation of these documents is therefore di昀케cult, and this is re昀氀ected in the quality of the digitization. Standardising all the images at the same time requires an initial sorting organised according to identical layouts for a gain in OCR quality that is not necessarily guaranteed. So we decided to prefer quantity over quality, even if the normalisation would occur on the raw data.

One of the major challenges in the world of automatic character recognition today is the segmentation of newspapers. Their complex layout requires the training of complex models that are o昀琀en speci昀椀c to a type of newspaper. We decided to train a Kraken segmentation model from the XML 昀椀les in ALTO format available on Gallica, with the help of the eScriptorium platform 1[ 8 ] and the module ketos. Once the ALTO 昀椀les were adapted to the good format, we could train the model to segment the images coming from tAherchives départementales de Corse du Sud. In order to improve the model, it was necessary to use the tool YALTA8]i d[eveloped by Thibault Clérice, which allows the use of YOLOv153[], an Ultralytics object detection model, to be adapted for training segmentation models with Kraken. For the text recognition phase, we decided to go for Tesseract-OCR, which includes a Corsican model. We needed to create UZN 昀椀les readable by this engine in order to follow the coordinates of the image (Figu1)r.e

2.2. Data standardisation

Once we got our raw textual data, we had to classify them according to their language, typology, and author. Then we could perform the cleaning of the textual data, carried out in four main stages: • The removal of punctuation • Case reduction • Normalization of syntax • Elimination of accents

The most delicate phase in our methodology is the normalisation of the syntax. It is important because, for euphonic reasons, contractions occur in written form in the form of elisions, 4https://gallica.bnf.fr/accueil/en/content/accueil-?emnode=desktop| http://archives.isula.corsica/Internet_THOT /FrmSommaireFrame.asp which re昀氀ect the discourse practices of speakers of Corsican. For example, the expressios’nè ellu hè (“if he is”) becomess’ell’è in writing. Inversely, restoring the original form of the elision requires taking into account the context of gender and numbeerl:l’ can give ellu, ella, elli, or elle. There is also the question of the normalisation rule: should we base ourselves on the syntax of the 20th century or on the current one? Moreover, a certain number of ambiguities can creep into such a correction, such as the wored, which, depending on the context, can mean either “the” or “and”. We should not forget to take into account that Corsican islaan“gue par élaboration” or Ausbau language and that, consequently, the syntax has a complexity due to the distinct instantiations according to the authors. In sociolinguistics, this type of language is a variant of a structured language (such as Italian) and set up as a distinct elaborated language [ 29 ].

The issue of data normalisation is particularly delicate due to the very nature of our methodology. While topic modelling does not include function words in the analysis because they are meaningless words, stylometry relies mainly on all types of most frequent words. Indeed, to what extent should we normalise the data? Do we lose information if we normalise the syntax of certain terms, or do we gain information? The choices that have been made are recorded in the Python 昀椀le dedicated to data cleaning. This is nevertheless an important bias for our analysis. Fortunately, the regiolectal diversity of the Corsican language means that the idiomatic features of the authors are characterised by the great variety of the function words used. A thorough normalisation should not alter our analysis too much, even if it constitutes an improvement perspective for our study.

In the end, we obtained a total of 3 corpora of di昀erent sizes with a total of almost 1.5 million words (Table1), with approximately 56.7% of the words in Corsican, 27% in French, and 16.3% in Italian. The main point of improvement in this method of extracting textual data is the balancing of the corpus. While we were able to obtain almost all the articles in the issues on Gallica, the issues on THOT were selected according to our needs, given the variation in the quality of the images. However, this is still quite su昀케cient for the type of analysis we are carrying out, focusing on a certain number of authors. Details of the samples selected for this study can be found in the appendix (Table4).

3. Method proposed: two layers of analysis

The last advances in stylometry have been made with the use of machine learning algorithms. Recent examples include the work of Jean-Baptiste Camps and Florian Ca昀椀ero, who used SVM classi昀椀er algorithms to identify the authors of the American conspiracy foruQmAnon [ 6 ]. This means that we can now tackle the question of the statistical units to be analysed with our algorithms, whether using machine learning techniques or distance metrics. The two French researchers chose to work on character 3-grams because of the “increase robustness”, they are “known to reduce sparsity and perform well in attribution studies”. In reality, the features to be analysed vary according to the nature of the corpus and the quality of the data. One example is the measurement of verses in poetic works to measure an author’s styl3e] a[nd even the rhymes in mediaeval texts like Mike Kestemont did in 20121[ 5 ]. For the previous thesis, we managed to compare the results obtained with the SVM with a metric distance, the Delta score as de昀椀ned by John Burrow in 20024[], in order to con昀椀rm them considering the limited length of the corpora. The objective of this double layer of analysis was to con昀椀rm the results and determine the best possible approach for our corpus. This paper will focus on the machine learning approach, but the results obtained with Burrow’s Delta that con昀椀rmed the SVM methods are available on the GitHub repository2[ 6 ]. The script being used is the SuperStyl one developed by Jean-Baptiste Camps in 2021 [ 7 ]. Whatever the authors and pseudonyms tested, we excluded poetic texts part in prose or verse from the stylometric due to their speci昀椀cities.

It is very important to vary the hyperparameters available to us in order to optimise machine learning. To do this, the SuperStyl algorithms allow us great 昀氀exibility in the options to be taken into account. A昀琀er various tests presented in the benchmark (Table6), we chose those parameters: the statistical units are the most frequent words; we apply the PCPArin(cipal Component Analysis) for dimensional reduction; the cross-validation is carried out with the “Leave-One Out” method; and we balance the dataset with the ”upsampling”. This technique consists of isolating a portion of our minority corpus and sampling an equal number of examples from the majority class, as explained by Joseph Barr in 20222].[Once the model has been trained, we apply it to the unseen data. In view of the large number of candidates for the second experiment, we initially subdivided them into two groups in order to obtain more precise results before carrying out an analysis on the whole corpus.

Concerning topic modelling, the LDALa(tent Dirichlet Allocation) is a method based on a term-document matrix. This method is based on the assumption that “documents are represented as random mixtures of latent topics, where each topic is characterised by a distribution of words”. The LSI L(atent Semantic Indexing), on the other hand, consists of creating a semantic space based on a corpus in which similarities between words or documents are calculated on a statistical scale. Each of these methods has its own advantages and disadvantages that need to be taken into account, hence the importance of the notion of comparability inherent in our study[ 9 ]. In 2020, a group of researchers set out to compare the two methods by training them on a corpus of BBC articles1[ 4 ]. The results of their research revealed that LSI is more e昀ective when dealing with a large amount of data and fewer iterations than LDA, while the latter is more suitable for smaller corpora. The idea is to present here the most interesting results with an empirical observation of the results obtained as a form of intrinsic evaluation. In the long term, implementing more e昀ective evaluation metrics such as coherence would be very relevant, even if it is not necessary in our case, given that we are modelling general themes rather than assigning a label to each article. To do so, we used the Gensim package for Python, which o昀ers wide possibilities for performing both LSI and LDA techniques. The di昀erent experiments presented in the appendix, along with the hyperparameters and methods used, are detailed in the summary table (Tab5l)e. Table 8 serves as a glossary containing pertinent words that were modelled in the course of the experiments.

The vocabulary plays an essential role in topic modelling. The words chosen to be taken into account in topic modelling must not be too numerous, as training the model can be extremely time-consuming. The number of documents and the vocabulary chosen will therefore play a central role among the various biases to be applied. Unlike stylometry, function words are of no interest because they are considered to be empty words, i.e., words without a signi昀椀cant meaning but serve to add details to the sentence1[]. We had to create a speci昀椀c list of stopwords for our Corsican corpus (Tabl7e) due to the absence of a basic language toolkit1[ 7 ]. The list creation process occurred in two phases: initially, it involved comparing it with an Italian list that contained overlapping stopwords with Corsican. Following that, it consisted of the examination of various corpora, including thMeuvra dataset. This examination led to the identi昀椀cation of the most frequent words, followed by a selection between stopwords. The idea is therefore to remove them in order to reduce the vocabulary. But there is also the case of hapax or infrequent words, as well as frequent words that are not stopwords, suchcoarssi“ca” in this case. One solution is to include the notion of statistical entropy in the choice of vocabulary as presented by Susan Dumais in a 1992 article1[0] with the following formula:

In this equation,ndocs represents the number of documentst,f is the frequency of the term i in the documentj, and gf is the overall frequency of the ter mi. The idea is to calculate the entropy of each word in the corpus and to select vocabulary within a de昀椀ned interval.

4. Results 4.1. The two pseudonyms chosen

The aim is to test our methodology on two di昀erent pseudonyms. The 昀椀rst,P. di B., allows us to check the reliability of our tools on a relatively small corpus in Corsican by con昀椀rming the identity of the author. The secondA,ltore, gives us the opportunity to test these tools on a completely unknown author, leaving us free to interpret and choose the candidates.

The pseudonymP. di B. is a name that appears fairly regularly in the writings of tMheuvra. A number of articles were published under this pseudonym, and it is generally accepted that it is actually Petru Rocca, as mentioned by Carmine Starace in the pages of hBiisbliogra昀椀a della Corsica [28]. This pseudonym is believed to be the initials of his mother’s surname, Maria Saveria Rocca-Pozzo di Borgo. The latter had remained very close to her sons Petru and Matteu, even publishing drawings in theMuvra. Con昀椀rming the writings of contemporary actors from this period also makes it possible to verify the rigour of their anthological work. It is also an excellent way of testing our methodology in a more or less reliable setting.

The other pseudonym seen in this paper isAltore. It is directly inspired by the lake of the same name in the Asco valley, in the old Cacciapieve within the region of the same name. Altore is the author ofLettere aiaccine, the letters from Ajaccio, which o昀琀en appeared on the front page of the newspaper. In this format, he covers all the subjects of society and politics in general in an open, family-friendly letter format. Our corpus contains 62 of these letters, all written in the Corsican language. The di昀케culty with this part of our study is that we have no information or clues about the real author behind this pseudonym. Nevertheless, its presence on the front pages of many issues at least testi昀椀es to the importance attached to this particular section and therefore to its author.

Concerning the candidates, apart from Petru Rocca, who seems obvious to include in the analysis given the information we provided earlier, we decided to choose two other potential authors. The 昀椀rst is Martinu Appinzapalu, a pseudonym of the Corsican priest Dumenicu Carlotti and symbol of the religious aspect of the insular’s autonomist struggle at the time, who published numerous articles throughout the paper’s existence and was part ofPtahretitu Corsu d’Azione, the political party attached to theMuvra. The second is Marcellu Alessandri di Chidazzu, one of the authors most involved in the writing and a fervent defender of the irredentist cause.

4.2. First experiment: P. di B.

The evaluation of the trained model is presented in the tab2l,ewe got an accuracy of0.95. We then obtain a 昀椀le with the predictions of the author of the articles and the results of the decision function that “tells us how close each sample is to the hyperplane separating each clas5s]”. A[ negative value means that the sample is outside; a positive value means it is inside. The higher the score, the greater the probability that this sample has been written by the candidate. By applying this function to our study, we get the 昀椀gure2. We have also added the identi昀椀ers of articles written byP. di B. whose authorship has not been attributed to Petru Rocca. On the whole, however, almost all the articles were attributed to Petru Rocca. Of the 34 articles in the test corpus, 26 are attributed to the director of thMeuvra, i.e., 76% of them. But what is even more interesting to study is the behaviour of the curves on the decision function graph. On average, the decision function scores are much higher for Petru Rocca’s texts.

Petru Rocca is an expert in this 昀椀eld, as nearly 昀椀ve di昀erent identities are attributed to him in the various anthologies and studies carried out on him. We 昀椀nd his signature, Petru Rocca or Pierre Rocca, and the pseudonymPsasquale Manfredi, P. di B, and P. di C. In view of the stylometric results, we can assume that these various identities attributed to him are indeed his own. In order to optimise the performance of our stylometry models, several parameters need to be taken into account, such as the number ofk topics, iterations, words, and passes. Petru Rocca writes mainly in Corsican, although he does leave an important place for French. He also writes a little in Italian, but there are too few texts to be relevant. If we can reference 139 articles written by Rocca in total, we performed the LDA on sub-corpora according to language and pseudonym (Figures4, 5, 6, 7, 8). It is important to note that for reasons of data quantity, we have grouped together in the same sub-corpus the texts signed by Petru Rocca and Pierre Rocca as well as the texts signed byP. di B. and P. di C. We assume that these have the same utility, but this is obviously a point to be improved in further analyses of the question.

The pseudonyms seem to allow Petru Rocca to evoke a wider spectrum of speci昀椀c subjects that remain around political and cultural current a昀airs. Similarly, the use of language doesn’t seem to be part of any attempt to separate themes, with French and Corsican acting more as a complement to each other, even if the local dialect seems to be used more to address cultural notions. How then to explain the use of several pseudonyms to express himself in his own newspaper? Let’s not forget that he is in fact the director of thMeuvra. This can be attributed to propaganda objectives. Indeed, even though there are a large number of contributors, there are very few who are really involved in the corsist struggle over the long term. For Rocca, it would be a question of in昀氀ating the numbers of contributors a little in order to get a more substantial core of regular authors to appear. It’s not all ideology, and there are sometimes simpler justi昀椀cations to understand the muvrists’ approach. This reason can also be seen in the public demonstrations organised by the autonomists. Thus, in 1934, a number of participants are mentioned in the sixth edition of themerendelle d’i pueti còrsi.5 The list includes Dumenicu Carlotti, Eugeniu Grimaldi, Petru Rocca, and a certain Pasquale Manfredi.

4.3. Second experiment: Altore

In the same way as we con昀椀rmed Petru Rocca’s authorship of the texts ofP. di B., we carried out the stylometric analysis of those oAf ltore using the SVM classi昀椀er. For the candidates, we chose a wide range of possible authors among the most important ones in tMheuvra. For

VINCIGUERRA PIAZZOLI ALESSANDRI VERSINI

CARLOTTI

ROCCA NOTINI GIANVITI macro avg weighted avg this experiment, the 昀椀rst sub-group mentioned above was made up of Ghjanettu Notini, Victor Gianviti, Dumenicu Antone Versini and Marcellu Alessandri. The second was made up of Simon’Ghjuvanni Vinciguerra, Orsu Francescu Piazzoli, Petru Rocca and Dumenicu Carlotti.

We thus obtain an accuracy of about0.86 and a model quite good, as seen on the tabl3e. This test bears witness to another important aspect of stylometry that has not yet really been addressed in this paper: the notion of corpus size as a function of the number of candidates. This echoes the article by Eder Maciej published in 2015 at Oxford Universi1ty1][where he stated that “the e昀ectiveness of attribution depends on corpus size and particularly on the number of authors tested”.

The results of the decision function (Figur3e) show us that Ghjanettu Notini is the most likely candidate among the panel of candidates. But stylometry, like any computational method used in the 昀椀eld of digital humanities, also requires more in-depth research with “close reading”. Numbers are not proof. Ghjanettu Notini was born on December 4, 1890, in San Petru di Venacu, in the oldpieve of Venacu in Corsica’sCurtinese region. Interestingly enough, this region of central Corsica is relatively close to Lake Altore. He was a Corsican poet and writer who contributed for many years to theMuvra under the pseudonymU Sampetracciu. Nicknamed the “Corsican Molière”, according to Ghjacumu Thiers, he was the founder of Ttheaetru corsu di A Muvra in the early years of the newspaper and a loyal contributor.

We can notice certain terms that come up frequently on the wordcloud that visualises the results of topic modelling oAnltore (Figure9), such as “corsu” or “corsica”. This brings us faceto-face with our vocabulary selection methodology. These words are very frequent but remain essential in the context of a Corsican autonomist newspaper. Nevertheless, certain trends stand out, with political issues omnipresent in theslettere aiaccine. In particular, there is the notion of the French politician and industrialist Paul Lederlin, who was elected Senator for Corsica in 1930. ForU Sampetracciu (Figures10, 11), we see that the plays written by Ghjanettu Notini are particularly dominant in the detection of topics. This can be seen thanks to the large number of 椀昀rst names, typical of the theatrical style, which incorporates a lot of dialogue. Other elements highlight this, such as the presence of the onomatopoeiaA“h” or the term “scena” (scene). We can also observe the poetic dimension of Notini’s work with Topic 3 of the LDA: we 昀椀nd there the lexical 昀椀eld typical of Corsican poems with the importance of them“amma” (mother).

It seems fairly obvious that the Corsican author seems more inclined to evoke political and topical themes with the pseudonym. He does this in a very particular literary style, that of the open letter, which corresponds quite well to Notini’s great talent for writing. However, Notini did not hesitate to raise these intrinsically political issues in his plays. Likewise, his poetry does not appear to be a simple ode to the beauty of Corsica but a complete reworking of the island’s poetic traditions through the prism of thlaementu, a poetic style cherished by the muvrists.

5. Further research

While this research shows promise, it is important to acknowledge its limitations, which are closely intertwined with its strengths. In the long run, it would be pertinent to develop a dedicated OCR model for recognising printed Corsican text. Additionally, exploring the possibility of 昀椀ne-tuning the segmentation model to enhance its e昀ectiveness holds signi昀椀cant potential. This article has highlighted the constraints of using topic modelling techniques, which may not be the most suitable approach for detecting word characteristics. Considering this, alternative methods like frequency-based analysis could be more appropriate, given our knowledge of the speci昀椀c vocabulary found in the Muvra dataset. Moreover, the time invested in removing stopwords might have been unnecessary, as demonstrated by the experiments conducted by Alexandra Scho昀케eld and her colleagues 2[ 7 ]. Lastly, in terms of stylometric analysis, it is essential to conduct it on the entire newspaper corpus to validate the obtained results, and this should coincide with a more careful selection of candidates for analysis.

6. Conclusion

The dual nature of pseudonym usage can also be clari昀椀ed by considering how we employ it. An identity can be used to evoke more sensitive subjects that we wouldn’t discuss without it. Ghjanettu Notini makes no secret of the fact that he isU Sampetracciu when he writes his plays and poetry. Even if he tackles speci昀椀c political themes, he never goes too far and e昀ectively protects himself from criticism behind his dramatic work. But it’s thanks to his hypothetical identity as Altore that Notini can really express his intentions, with more assertive political discourse and fewer 昀椀lters. On the contrary, the use of a pseudonym may not have a purely ideological role but a more propagandist one, as in the caseP.odfi B for Petru Rocca.

Studying a weekly newspaper spanning almost 20 years represents a real technical challenge that forces us to make choices. Confronted with the intricate nature presented by the numerous metadata within our dataset, we had to make choices and apply biases in order to obtain an overview of what computational methods can o昀er in the study of such a corpus. It would be possible to perform a stylometric analysis on all anonymous authors or topic modelling on every combination of articles, but it would be time-consuming and represent a possible improvement to this research. In addition to determining the authorship of certain pseudonyms and the role of others, the question was also to work on an under-resourced language. The aim is to encourage this type of study in areas other than pure linguistics, as can be done at the Università di Corsica. While the complexity of the subject is a fact, it does not prevent us from obtaining coherent and promising results for the future. With better preparation of the data, as part of a broader project that would include more resources to allocate to the research, this subject has a lot of potential.

Acknowledgments

I would like to thank Jean-Baptiste Camps and Alessandro Lenci for their supervision of this research. Although this paper is the conclusion of a two-year dissertation, it is also the fruit of cooperation with several researchers, including Angelo Mario Del Grosso and Federico Boschetti, members of the CNR Pisa.

Downsampling

Downsampling Downsampling

Upsampling Downsampling Downsampling

Upsampling Downsampling Upsampling Upsampling Upsampling Upsampling

Upsampling Downsampling Upsampling Upsampling Upsampling Word a昀aire a昀are ami amore article babbu barbare bien canta centrale chemin concours confrere contre core corse corsu/a/e/i

croce cumitatu cummissione cumpagnu cumpare directeur droit elettori esprit fede federazione fonctionnaire français francese franchi francia

fuir gauche giurnale gouvernement guerra guvernu histoire honneur ile isula italie italien/ne jente jeune jornu jour legge liberta lingua french corsican french corsican french corsican french french corsican corsican french french french french corsican french corsican corsican corsican corsican corsican corsican french french corsican french corsican corsican french french corsican corsican corsican french french corsican french corsican corsican french french french corsican french french corsican french corsican french corsican corsican corsican

Word manu marseglia matrimoniu megliu merre ministru minuranze moda mondu monsieur nasitortu naziunale oghie omu paese parigi parti passager patrie pays poetes politique populu postal/aux presse prete prima primavera prisidente prix projet prova pueti pulitica raghione razza sangue santu/a scena separatisti sgio, scio sicondu stampa statu surete teatru temps varghiolu vergogna

vita vitesse vole

[1]

Arun ,

Suresh , and

C. V.

Madhavan . ““ Stopword graphs and authorship attribution in text corpora”” . In2:009 IEEE international conference on semantic computing. Ieee . 2009 , pp. 192 - 196 . doi: 10 .1109/icsc. 2009 . 101 .

[2]

J. R.

Barr ,

Sobel , and

Thatcher . “ “Upsampling, a comparative study with new ideas” . In: 2022 IEEE 16th International Conference on Semantic Computing (ICSC) . 2022 , pp. 318 - 321 . doi: 10 .1109/icsc52841. 2022 . 00059 .

[3]

Beaudouin and

Yvon . ““ Contribution de la métrique à la stylométrie””.AIcnt:es des 7èmes Journées Internationales d'Analyse Statistique des données textuelles (JADT) . Vol. 1 . 2004 , pp. 107 - 118 . url: https : / / imt . hal . science / file / index / docid / 741596 / filename /JADT%5C% 5F133%5C%5FBeaudouinYvonDef20030116.pd .f

[4]

Burrows . ““'Delta' : a measure of stylistic di昀erence and a guide to likely authorship”” . In: Literary and linguistic computing 17-3 ( 2002 ). doi: 10 .1093/llc/17.3.267. url: https://a cademic.oup.com/dsh/article-abstract/17/3/267/92927. 7

[5]

Ca昀椀ero and J.-B. Camps . “ “Psyché'as a Rosetta Stone? Assessing Collaborative Authorship in the French 17th Century Theatre”” . InP:roceedings of the Conference on Computational Humanities Research 2021 . Vol. 2989 . Ceur-ws. 2021 , pp. 377 - 381 . url:http://star.i nformatik.rwth-aachen.de/Publications/CEUR-WS/Vol- 2989 /long%5C% 5Fpaper51 ..pdf

[6]

Ca昀椀ero and J.-B. Camps . ““ Who could be behind QAnon? Authorship attribution with supervised machine-learning”” . Ina:rXiv Cornwell University abs/2303. 02078 ( 2023 ). doi: 10 .48550/arXiv.2303. 02078 .

[7]

J.-B.

Camps. SUPERvised STYLometry (SuperStyl) . Version 0.9.0 . 2021 . url:https://github .com/SupervisedStylometry/SuperSty.l/

[8]

Clérice and

ChauhanY.ALTAi , You Actually Look Twice At it. Version v0.0.1rc4 . 2022 . url: https://github.com/PonteIneptique/YALTA.i

[9]

Cvitanic ,

Lee ,

H. I.

Song ,

Fu , and

Rosen . ““LDA v. LSA: A comparison of two computational text analysis tools for the functional categorization of patents”” . In: International Conference on Case-Based Reasoning . 2016 . url: https://par.nsf.gov/biblio/1 0055536.

[10] S. Dumais. “ Enhancing performance in latent semantic indexing (LSI) retrieval” . 1992 . url: http://www2.denizyuret.com/ref/dumais/Enhancing%5C %5FLSI%5C%5F%5C%5F%5C%5 FDumais%5C%5F1991.pdf.

[11]

Eder . ““ Does size matter? Authorship attribution, small samples, big problem”” . In: Digital Scholarship in the Humanities 30.2 ( 2015 ), pp. 167 - 182 . doi: 10 .1093/llc/fqt066.

url: https://academic.oup.com/dsh/article-abstract/30/2/167/39073. 8

[12]

Fracastoro , E. Magli, G. Poggi,

Scarpa ,

Valsesia , and

Verdoliva . ““ Deep learning methods for synthetic aperture radar image despeckling: An overview of trends and perspectives”” . In: IEEE Geoscience and Remote Sensing Magazine 9.2 ( 2021 ), pp. 29 - 51 . doi: 10 .1109/mgrs. 2021 . 3070956 . url: https://ieeexplore.ieee.org/document/941674. 0

[13] G. Jocher. YOLOv5 by Ultralytics . Version 7.0 . 2020 . doi: 10 .5281/zenodo.3908559. url: https://github.com/ultralytics/yolo.v5

[14]

Kalepalli ,

Tasneem ,

P. D. P.

Teja , and

Manne . “ “E昀ective comparison of LDA with LSA for topic modelling”” . In2:020 4th International Conference on Intelligent Computing and Control Systems (ICICCS) . Ieee . 2020 , pp. 1245 - 1250 . doi: 10 .1109/iciccs48265. 2020 . 9 120888. url: https://ieeexplore.ieee.org/abstract/document/91208.88 [15]

Kestemont ,

Daelemans , and

Sandra . ““ Robust rhymes? The stability of authorial style in medieval narratives”” . InJ:ournal of Quantitative Linguistics 19-1 ( 2012 ), pp. 54 - 76 . doi: 10 .1080/09296174. 2012 . 638796 . url: https://www.tandfonline.com/doi/full/10.1 080/09296174. 2012 . 638796 .

[16]

Kevers ,

Gueniot ,

A. G.

Tognotti , and

S. R.

Medori . ““ Outiller une langue peu dotée grâce au TALN: l'exemple du corse et BDLC””. In2:6e Conférence sur le Traitement Automatique des Langues Naturelles . Atala. 2019 , pp. 371 - 380 . url: https://hal.science/hal02452276/.

[17]

Kevers and

S. R.

Medori . ““ Towards a Corsican Basic Language Resource Kit””. 1In2t:h Language Resources and Evaluation Conference (LREC 2020 ). 2020 . url: https://hal.scienc e/hal-02865699/.

[18]

Kiessling ,

Tissot ,

Stokes , and

D. S. B.

Ezra . “ “eScriptorium: an open source platform for historical document analysis”” . IInn:ternational Conference on Document Analysis and Recognition Workshops (ICDARW) . Vol. 2 . Ieee . 2019 , pp. 19 - 24 . doi: 10 .1109/icd arw. 2019 . 10032 . url: https://ieeexplore.ieee.org/abstract/document/88930.29

Otsu . “ “A threshold selection method from gray-level histograms”” . IEInE:E transactions on systems, man, and cybernetics 9 -1 ( 1979 ), pp. 62 - 66 . url: https://cw.fel.cvut.cz/b 201/%5C%5Fmedia/courses/a6m33bio/otsu.pd.f D. Paci . “ Il mito del Risorgimento mediterraneo: Corsica e Malta tra politica e cultura nel ventennio fascista” . PhD thesis . Université de Nice Sophia-Antipolis, 2013 . urlh:ttps://w ww. theses.fr/2013NICE2012.

[21]

J.-P.

Pellegrinetti and A. RovereL. a Corse et la République . La vie politique, de la 昀椀n du second Empire au début du XXIe siècle . Paris, Média Di昀usion, 2013 , 688 p.

A.-T.

Pietrera . “ Imaginaires nationaux et mythes fondateurs; la construction des multiples socles identitaires de la Corse française à la geste nationaliste” . PhD thesis . Université de Corse Pascal Paoli, 2015 . urlh:ttps://www.theses.fr/2015CORT0008.

[23]

Rogé . “ Le corsisme et l' irrédentisme 1920-1946: histoire du premier mouvement autonomiste corse et de sa compromission par l'Italie fasciste” . PhD thesis . Paris 10, 2008 , 1 vol. ( 882 p.) url:http://www.theses.fr/2008PA100048.

[24]

Sarbach-Pulicani . A“uthors pro昀椀ling in Corsican autonomist press during the interwar period . Stylometric analysis and topic modeling on ”A Muvra” ” . MA thesis. École nationales des chartes (PSL) and Università di Pisa , 2023 . doi1: 0 .5281/zenodo.8381161.

[25]

Sarbach-Pulicani. L“ a presse corsiste et irrédentiste des années 1930 : étude comparative et quantitative des revues A Muvra et Corsica antica e moderna entre 1932 et 1939” . MA thesis. Université de Strasbourg, 2021 .

[26]

Sarbach-Pulicani . Stylometry and topic modelling in Corsican language . Version 2.0.4 . 2022 . url: https://github.com/vincentsarbachpulicani/Corsican-Stylomet.ry [27] A. Scho昀椀eld , M. Magnusson, and D. Mimno . “ “Pulling Out the Stops: Rethinking Stopword Removal for Topic Models””. IPnr:oceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics . Valencia, Spain: Association for Computational Linguistics, 2017 , pp. 432 - 436 . urlh:ttps://aclanthology.org/E17-206.9 [28]

Starace . Bibliogra昀椀a della Corsica . Centro di studi per la Corsica. Milano, Istituto per gli studi di politica internazionale: Istituto per gli studi di politica internazionale , 1943 .

[29]

Viaut . ““ Marge linguistique territoriale et langues minoritaires””L.eIn:gas . Revue de sociolinguistique. 71. Presses universitaires de la Méditerranée, 2012 , pp. 9 - 28 . url: https://journals.openedition.org/lengas/3.01