1. Introduction

You write like a GPT

Andrea Esuli

Fabrizio Falchi

Marco Malvaldi

Giovanni Puccetti

We investigate how Raymond Queneau's Exercises in Style are evaluated by automatic methods for detection of artificiallygenerated text. We work with the Queneau's original French version, and the Italian translation by Umberto Eco. We start by comparing how various methods for the detection of automatically generated text, also using diferent large language models, evaluate the diferent styles in the opera. We then link this automatic evaluation to distinct characteristic related to content and structure of the various styles. This work is an initial attempt at exploring how methods for the detection of artificially-generated text can find application as tools to evaluate the qualities and characteristics of human writing, to support better writing in terms of originality, informativeness, clarity.

eol>GPT style generated text human writing

1. Introduction

of literature they have been exposed to. Using LLMs to assess how much a text difers from the production capaThe extraordinary writing ability of the latest chatbots bilities of LLMs inherently implies an evaluation of the and virtual assistants based on Large Language Models novelty it represents compared to known literature. (LLMs) poses a significant question for anyone who at- Starting to move in this direction, this article explores tempts to write today —- be they a scientist, a writer, whether an LLM can be used to help humans answer this or a lover: is it worth the efort to engage in the act of question. In this first attempt we do this not based on the writing? content intended for communication but on the style. We

For those not hindered by excessive laziness and who, have conducted a preliminary study on the possibility of with courage, still tackle writing with determination and using LLMs to evaluate how and to what extent a certain passion, this question implies a more specific one: am writing style and/or a specific text difers from what a I writing a text that an artificial intelligence could not machine can achieve. have produced? We took as a reference Raymond Queneau’s “Exercises

We believe that the answer to this question may, in the in Style” [1], which draws from Erasmus of Rotterdam’s future, come from the LLMs themselves given that they “De Utraque Verborum ac Rerum Copia” [2] a bestseller are designed to assess the probability of the occurrence widely used for teaching how to rewrite pre-existing texts of the next word in a text. We envision a future where and how to incorporate them into a new composition. In LLMs, although widely used to produce essentially obvi- Queneau’s work, the same simple story is revisited each ous texts, will assist those who still engage in writing to time in a diferent literary style. We asked ourselves and create texts worth reading, if only because the artificial in- conducted experiments on how much the texts in various telligence, having read and statistically evaluated almost styles used by Queneau difer from the writing abilities everything ever written, considers them non-obvious and of LLMs, which have acquired their skills by learning distinct from what it would have produced itself. statistical relationships from vast amounts of text.

The ability of LLMs to evaluate the probability of the Calvino had already attempted to answer this question: next word in a text stems from the extensive corpus of “What would be the style of a literary automaton?” He writing they are trained on. Consequently, their evalua- replied, “The test for a poetic-electronic machine will be tion of a piece of writing is ultimately based on an indirect the production of traditional works, of poems with closed comparison between the given text and the entire body metric forms, of novels with all the rules”. We believe it has indeed happened this way, as today’s chatbots and CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, virtual assistants are built from a language model. Dec 04 — 06, 2024, Pisa, Italy In this work, we provide initial evidence that language †All the authors contributed equally. models recognize those texts that are more traditional, ($F. aFnaldcrheia).;egsiuolvi@aninstii..pcuncrc.iett(tAi@.Eisstui.lcin);rf.iatb(rGiz.ioP.ufaclccehtti@i) isti.cnr.it particularly used in spoken language or by classical char0000-0002-5725-4322 (A. Esuli); 0000-0001-6258-5313 (F. Falchi); acters as more probable while they deem more unlikely 0000-0003-1866-5951 (G. Puccetti) experimental and innovative texts. However, we find © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License

Attribution 4.0 International (CC BY 4.0). evidence that even for powerful LLMs it remains dificult to cut a clear line between experimental texts and those that instead incur the risk of becoming unreadable.

2. Related Work The evaluation of text readability may dated back at least

to the work of Flesch in 1948 [3]. Flesch’s method was based on simple surface properties of text (i.e., words per sentence and syllables per word). Since then a steady evolution of methods involved more complex NLP and ML as new tools were developed (see the surveys [4, 5]).

An example of the use of LLMs on this topic is the work Miaschi et al. [6], which investigated the correlation between a readability score measured by an automatic readability tool (READ-IT [7]) and the perplexity measured by an LLM, yet they found no significant correlation between the two dimensions.

Hayati et al. [8] compared human and BERT-based relevance scoring of words in a sentence to determine its style, polite or ofensive, as well as the expression of sentiment and emotions. They found a loose correlation in the way words are identified as relevant by humans and BERT, with BERT giving more relevance to context word (e.g. “baseball” for the emotion of joy), while human are more focused on words perceived as “typical” of the style. (e.g., “smile” for joy).

The style transfer process is the task of rewriting a passage of text changing the set of lexical choices and syntactic structures, yet not substantially changing the actual content of the text. Krishna et al. [9] surveys the style transfer literature and proposed a style transfer method trained on reconstructing a style-specific text (inverse paraphrase) on pseudo-parallel data generated using a diverse paraphrase model.

Qi et al. [ 10 ] proved that a change of the writing style, made using a trained model, can be an efective means of attack to BERT-based classifiers, e.g., letting an ofensive text be classified as non-ofensive just by rewriting it using a Bible-like style. Similarly Krishna et al. [11] have shown that automatic paraphrasing can be extremely efective at breaking the ability of detection method to recognize artificially generated text.

3. Writing with style

Queneau’s original work in French of 1947 [1] tackles on telling the same short story using 99 diferent styles. The ifrst style, Notations, is a clear report of a sequence of events, each with details that together define the actual content of the story that is reported in all of the other 98 versions. Each version has a defining title that denotes its style. Styles can be grouped by similarity; Barbara

Wright, who made the English translation in 1958 [12], reports to have roughly identified seven groups 1: • diferent types of speech; • diferent types of written prose, e.g., Oficial Letter, Philosophic; • five poetry styles, e.g., Haiku, Ode; • eight language-based character sketches, e.g., Reactionary, Biased, Abusive; • grammatical and rhetorical forms, e.g., Litotes,

Synchesis, Parts of speech; • jargon, e.g., mathematical, botanical; • and the very specific group of Permutations, by groups of letters or words.

Along time, new editions presented variations in the list

of styles. For example, five styles in the original edition 2, were replaced by other five in the edition of 1969 3, the one we used in our experiments.

Queneau’s opera has been translated in more than 30 languages. The Italian translation was made by Umberto Eco [13], in 1983. Similar to other translations, the Italian translation reports almost all the original styles, but some are considered untranslatable and replaced with variants

1In the preface of the book where the groups are listed, Wright did

not report a complete assignment of all styles to these groups, only hinting a few cases for some of them. 2Réactionnaire, Feminine, Hai-Kai, Permutations de 2 á 5 lettres, Permutations de 9 á 12 lettres. 3Ensembliste, Définitionnel, Tanka, Translation, Lipogramme.

The Research Question (RQ) we wish to answer is the fol

lowing: Can we use Machine Generated Text (MGT) detection methodologies to measure some qualities and characteristics of the style used in writing a piece of text?

Our assumption supporting the relevance of this RQ is that LLMs, trained on trillions of tokens, naturally approximate an average writing style that is necessarily “average” and thus not original or unique. On the other hand, original and surprising writing styles, which by definition will come in many very diferent forms, will be less frequent, and sparse across the long tail in the distribution of training data, and thus modeled as less semantically similar to the original ones, or relevant for likely according to the LLMs. other reasons. For example the style Homophonique was We use two metrics to measure the style of texts accordreplaced by Eco with a style named Vero? (True?), be- ing to language models, Log Likelihood (LL) and Detectcause French has many homophones while Italian has GPT [14], these metrics are used to detect text generated not. The Vero? style links to the repeated use of intercala- by a given language model since on average they will tion and links to the Alors style of the French edition. Eco be higher for text that a language model has generated, also decided to not translate the Loucherbem style, based when compared to text written by a human. on the slang spoke by Parisian and Lyonnaise butchers, We focus on Eco’s Italian and Queneau’s original considering not interesting to link it to an Italian slang French versions of the style exercises. To measure the or dialect, whereas dialect-based styles already were in- scores, we use LLMs tuned for these languages. For Italcluded in the opera. Eco replaced it with its own version ian we use Anita [15] while for French Mistral [16]. of the Réactionnaire style from the first edition, which he As a first validation of our assumption, Figure 1 shows liked more, as he detailed in the preface of his translation. the correlation between the Log Likelihood each writing style passage is assigned in Italian (y-axis) and in French (x-axis). The Figure shows significant correlation and 4. Style and detection, is there a zooming in on the higher Log Likelihood texts, Figure 2, relation? we see that the correlation persists.

Similar results hold for DetectGPT, Figure 3, shows the correlation between this score for the Italian texts and for the French ones, and the correlation is close to the one for Log Likelihood shown in Figure 2.

Both Figures 2 and 3 show style number 98 as a kind of outlier. This is a correct measurement as style 98 is actual two diferent styles between the two versions, Loucherbem in French, and Reazionario in Italian, as reported in Section 3.

Both Log Likelihood and DetectGPT appear to behave consistently across languages and styles, supporting our hypothesis that some characteristics of the writing styles are captured by these scores. 4.1. Analysis of Detection Scores of Styles

4Character as in “the character of a play”. The other group which contains all those styles which

are harder to assign to a specific group is evenly spread across the lower ranks with few exceptions indicating that the texts that compose it are indeed quite varying and hard to group together.

An overall look at the ranking without considering the groups suggests a relation between the scores of detection methods and some characteristics of the styles. Styles that make use of unusual, or just made up words, or do not use a correct syntax, get low detection scores. Styles that are based on a clean, modern prose, with a simple syntax, get high detection scores. The middle ranks show a smooth transition among the two extremes, in which the use of unusual terms or syntax is more frequent as the detection scores get lower.

Table 1 shows the actual value of Log Likelihood and DetectGPT for each passage in both Italian and French as well as their ranking among all style exercises, ranked based on the DetectGPT score in Italian. We adopted Wright’s grouping of styles, assigning each style to one of the seven groups listed in Section 3, and also adding an “other” group for styles for which we could not find a clear positioning in Wright’s groups (typically the styles based on almost obsessive repeated use of some kind of expression). The (colored) gr. column reports the style group that is assigned to each style exercise and we can observe that ranking the styles based on the DetectGPT scores in Italian (as they are reported in the table) highlights a few prominent patterns which we now describe.

The permutation class is present only in the lower ranks, and indeed the texts belonging to this group are 5. Conclusions hard to read and don’t show any recognizable stylistic pattern, they are more akin to games that makes sense This work is a first exploration of the idea of designing only within the context of Queneau’s book. tools that evaluate how and to what extent a writing

The texts belonging to the jargon class are also style and/or a specific text difers from what a machine grouped together, with the exception of the “Zoologico” can achieve. We tested for this task the use machine (Zoological), “Botanico” (Botanical) and “Medico” (Medi- generated text detection tools, under the hypothesis of a cal) ones, and are still in the lower end of the tail. Anecdo- correlation between their detection scores and our goal of tally, the three jargon styles that are in higher ranks are discovering the many facets that build an original human likely to be present in higher quantity in LLMs training written text. We applied them to Queneau’s exercises in data justifying the ranking shift. style, in which the same story is written using a rich and

The poetic class is the next one in average rank, just varied set of writing styles. We have found a consistent higher than the permutation one, with the exception of correlation between the scores assigned by detection the "Tanka" style, which is indeed a very short text, with methods, across detection methods and across languages. almost no syntax connecting minimal sentences. The comparison of the styles with their detection

Interestingly, right above the poetic group stands the scores indicates that lower scores from detection methods grammatical and rhetorical group; indeed rhetorical are correlated with the use of unusual terms or syntax, ifgures are a key component of poem writing. This group while higher scores are more related to styles that are is evenly spread among the middle ranks, with the excep- based on a clean and more prose, with a smooth transition of “Parti del discorso” (Part of speech), which is in a tion among this two extremes. The ranks thus do not lower position, and which also the one with more loose indicate a “better” or a more “interesting” style, yet they relation with grammatical and rhetorical group. confirm Calvino’s statement we reported in the introduc

The writing group, contains a large number of styles tion: content that is akin to a machine-generated one is and is spread across several ranks, however it is heavily the one that produce “traditional” content, following the skewed towards the higher ranks. main rules of writing.

The speech group is entirely in the higher ranks and Writers willing to depart from sounding “ordinary” as its spoken source suggests it has a strong character- could indeed use detection methods to estimate these rooted component. aspects on their content, with the caveat that while a mid

Accordingly, the only group that ranks higher than level detection score may suggest some original traits in speech is character4 which, with only two exceptions, text, low scores may not indicate a more original or inter“Ingiurioso” (Ofensive) and “Impotente” (Powerless), al- esting text, but they may likely derive from an obscure ways ranks in the top quarter, takes all 3 top ranks and is or plainly unreadable text. the highest ranking one. The last line of Table 1 reports Given the positive results of this first investigation, the ranks and scores for the Loucherbem style, which future developments will be based on the use of texts exists only in the French version. The ranks are very low specifically written for this activity. This will have the as this style uses almost made up words to replicate the advantage of having full control over the contents and phonetics of the jargon. to have the guarantee that they have never been part of the LLMs training data.

Acknowledgments This work was partially supported by PNRR - M4C2 - In

vestimento 1.3, Partenariato Esteso PE00000013 - "FAIR Future Artificial Intelligence Research" - Spoke 1 "Humancentered AI", funded by European Union - NextGenerationEU.

Liu (Eds.), Proceedings of the 2020 Confer-

tional Linguistics , Online, 2020 , pp. 737 - 762 .

URL: https://aclanthology.org/ 2020 .emnlp-main. 55 .

doi:10 .18653/v1/ 2020 .emnlp-main. 55 . [10]

Qi ,

Chen ,

Zhang ,

Li ,

Liu , M. Sun,

of the 2021 Conference on Empirical Methods in

minican Republic , 2021 , pp. 4569 - 4580 . URL: https:

//aclanthology.org/ 2021 .emnlp-main. 374 . doi:10.

18653 /v1/ 2021 .emnlp-main. 374 . [1]

Queneau , Exercises de style, Gallimard , 1947 . [2]

Erasmus , De Utraque Verborum ac Rerum Copia,

1512. [3]

Flesch , A new readability yardstick ., Journal of

applied psychology 32 ( 1948 ) 221 . [4]

Collins-Thompson , Computational assessment

research, ITL- International Journal of Applied Lin- [11]

Krishna ,

Song ,

Karpinska , J. Wieting,

guistics 165 ( 2014 ) 97 - 135 . M. Iyyer, Paraphrasing evades detectors of ai[5]

Vajjala , Trends, limitations and open challenges generated text, but retrieval is an efective defense,

preprint arXiv:2105.00973 ( 2021 ). M. Hardt , S. Levine (Eds.), Advances in Neural In[6] A.

Miaschi , C.

Alzetta , D.

Brunato , F.

Dell'Orletta , formation Processing Systems , volume 36 , Curran

Venturi , Is neural language model perplexity Associates, Inc ., 2023 , pp. 27469 - 27500 .

related to readability?, in: J. Monti , F.

Dell'Orletta , [12] R.

Queneau , B.

Wright , Exercises in style, Gaber-

Tamburini (Eds.), Proceedings of the Seventh bocchus Press , 1958 .

Italian Conference on Computational Linguistics , [13]

Queneau , U. Eco, Esercizi di stile, Gli Struzzi,

CLiC-it

2020

, Bologna, Italy, March 1- 3 , 2021 , vol- Einaudi, 1983 .

ume 2769 of CEUR Workshop Proceedings , CEUR- [14]

Mitchell ,

Lee ,

Khazatsky ,

C. D.

Manning ,

WS.org, 2020 . C. Finn, Detectgpt: zero-shot machine-generated [7]

Dell'Orletta ,

Montemagni , G. Venturi, READ- text detection using probability curvature , in: Pro-

IT: assessing readability of italian texts with a view ceedings of the 40th International Conference on

to text simplification , in: N. Alm (Ed.), Proceedings Machine Learning, ICML'23 , JMLR.org, 2023 .

of the Second Workshop on Speech and Language [15]

Polignano ,

Basile , G. Semeraro, Advanced

Processing for Assistive Technologies , SLPAT 2011 , natural-based interaction for the italian language:

Edinburgh , Scotland, UK , July 30 , 2011 , Association Llamantino-3-anita, 2024 . arXiv: 2405 . 07101 .

for Computational

Linguistics

, 2011 , pp. 73 - 83 . [16]

A. Q.

Jiang ,

Sablayrolles ,

Mensch , C. Bam[8]

S. A.

Hayati ,

Kang , L. Ungar, Does bert learn as ford ,

D. S.

Chaplot , D. de las Casas,

Bressand ,

through lexica , arXiv preprint arXiv:2109 .02738 M. -

A. Lachaux , P.

Stock , T. L.

Scao , T. Lavril,

( 2021 ). T. Wang,

Lacroix ,

W. E.

Sayed , Mistral 7b, 2023 . [9]

Krishna ,

Wieting ,

Iyyer , Reformulat- arXiv: 2310 . 06825 .