Extraction of Stylometric Information from Spanish Documents

Extraction of Stylometric Information from Spanish Documents CésarEspin-Riofrio University of Guayaquil

Delta Av. s/n 090510 Guayaquil Ecuador

Extraction of Stylometric Information from Spanish Documents 1613-0073 042604BAC85E685E12799375EF015343 GROBID - A machine learning software for extracting information from scholarly documents Style Stylometry Natural Language Processing Transformers

The writing style of individuals is the basis for tasks such as authorship attribution, authorship verification or authorship profile, associated with stylometric analysis. Traditional learning methods based on neural networks use the information encoded in the last encoding layer of a model such as Transformers. In this paper, we describe our thesis project in which we propose to investigate whether a deep neural network encodes the style in any way. To do so, we explore the intermediate layers and embeddings of the initial token encoding of all layers of BERT-based Transformer models, to identify and extract style features to improve stylistic modeling systems, with emphasis on the analysis of documents written in Spanish.

Introduction

Style is defined as a form of expression or way of writing, starting with the choice of words, the combination of various words, punctuation, sentence structure, grammatical patterns and all the elements that an author likes to use [1]. The analysis of authorial style, called stylometry, is based on the assumption that style is quantifiable in order to evaluate its distinctive qualities [2].

The tasks associated with stylometry include authorship attribution, authorship verification or authorship profiling, they are based on the analysis of the writing style of individuals. The problem has been extensively explored, resulting in several traditional methods and tools for extracting stylometric features from a text.

Natural Language Processing (NLP) systems were initially based mainly on learning rules from the extraction of style features from a text. Later, they were replaced by machine learning models. Current deep learning models encode the relationship between words and learn about the final embeddings of their encoding layers, with encouraging results in text classification tasks but, we do not know what information about style is contained along the encoding layers. In this sense, we are exploring what style information is collected in the embeddings throughout the coding layers of the Transformer models, in order to experiment in stylometric analysis tasks such as authorship determination applied mainly to the Spanish language. Doctoral Symposium on Natural Language Processing from the Proyecto ILENIA, 28 September 2023, Jaén, Spain. Envelope cesar.espinr@ug.edu.ec (C. Espin-Riofrio) Orcid 0000-0001-8864-756X (C. Espin-Riofrio)

In this paper, we describe our thesis project focused on the extraction of stylometric features from Spanish language documents. We highlight the importance of our research, review its origin and related works, state the hypothesis and describe our research along with the methods, experiments and specific research elements proposed.

Justification of the proposed research

Stylometric-based Natural Language Processing is an approach that uses style analysis techniques to study and characterize texts. Style is reflected in the words and expressions used in texts in aspects such as syntax and grammar and in other measures such as average number of words used, frequency of word usage, length of paragraphs, etc. How a computer system can represent the style of a text or a set of documents is important. Stylometry includes among its most important tasks authorship attribution, authorship verification and authorship profiling, most of them solved on the basis of the writing style of a text. Text classification is a fundamental task of NLP, where the style of a text is the basis for extracting features.

NLP applies machine learning methods to identify patterns, extract and analyze features related to the writing style of a text. Traditional models obtain features using artificial methods and then classify them with classical machine learning algorithms, the effectiveness of these methods is largely limited by feature extraction. In contrast, deep learning integrates feature engineering into model fitting by learning a set of transformations that map features directly to outputs. Since their emergence, deep learning models treat the issue of style almost blindly, the models are applied to learn about features and relationships between words within text without delving into style.

We consider it important to delve deeper into what a neural network learns in relation to style in order to apply it to new models for solving text classification tasks such as authorship detection.

On the other hand, there are about 496 million people in the world who speak Spanish natively, making it the second most spoken language in the world [3]. Therefore, it is very important to carry out studies related to machine learning methods to extract style features from Spanish language documents to solve different NLP tasks.

Related work

The beginning of stylometry dates back to Augustus de Morgan's suggestion to resolve authorship disputes by means of word length frequency, in the year 1851 [4]. His hypothesis was investigated by [5], who published the results of his work by measuring the length of several hundred thousand words from the works of Bacon, Marlowe and Shakespeare. George Zipf discovered, using logarithmic scales, that there was a relationship between the rank and frequency of words, later known as Zipf's Law [6]. [7] measured word frequency for vocabulary richness analysis, now known as the "Yule characteristic". [8] used statistical methods to investigate the authorship of the Federalist Papers. The Federalist problem has been used subsequently as stylometry's 'testing ground' for new techniques. In the late 1980s, John Burrows published a series of seminal articles in which he regenerated stylometry as a viable tool in authorship attribution [9,10,11]. The initial work involving neural networks with stylometry was presented in [12]. [13] achieved results consistent with those of Mosteller and Wallace described earlier, using just eleven of their thirty 'marker' words as input to a neural network.

The stylistic features of a text are present at various levels such as in the vocabulary, the syntax, the grammar, the semantics, and in some cases in the layout, presentation, etc. [14] carried out an exploration of 166 features used for authorship attribution including commonly used stylistic features and several others intended to capture emotional tone. [15] divided authorship attribution features into five groups, on which much work has been don: lexical [16], character [17], syntactic [18], semantic [15] and application-specific features [19].

Simple lexical features, such as word frequencies, word n-grams, function words, word or phrase length, have been widely used since early attribution work [5], functional words were useful features in [8], the usefulness of character n-grams was highlighted in [15,20]. Bag-ofwords (BoW) approaches have also been reported as being useful for authorship attribution [21]. Term Frequency-Inverse Document Frequency (TF-IDF) [22] uses the word frequency and inverses the document frequency to model the text.

Traditional methods are statistics-based models, such as Naïve Bayes (NB) [23], K-Nearest Neighbor (KNN) [24], and Support Vector Machine (SVM) [25]. In a PAN 2013 Competition [26], all participants used a machine learning algorithm for classification, including Decision Trees, Support Vector Machines and Random Forests, [27,28].

The evolution of better computer hardware like GPUs and word embeddings like Word2Vec [29] and Glove [30] increased the use of deep learning models like CNN [31] and RNN [32]. LSTM (Long short-term memory) [33] attempts to solve the short-term memory problem of RNNs by retaining selected information in long-term memory. Convolutional seq2seq [34] applies convolutional neural networks.

Transformers [35], apply self-attention, captures the weight distribution of words in sentences. The attention mechanism is often used in an encoder-decoder architecture, and there are many variants of attention implementations [36]. A Transformer encoder layer is composed of multihead self-attention following a position-wise feed-forward network (FFN) with the residual connection [37] and layer normalization [38]. Transformer architectures rely on explicit position encodings in order to preserve a notion of word order. A positional embedding should be considered together with the NLP tasks [39]. The absolute position embedding is used to model how a token at one position attends to another token at a different position [40] Pre-trained language models [41], became a trend among many NLP tasks. Pre-trained language models effectively learn global semantic representation and significantly boost NLP tasks, including text classification. It generally uses unsupervised methods to mine semantic knowledge automatically and then constructs pre-training targets so that machines can learn to understand semantics [42].

Transformed-based pre-trained language models (T-PTLM) learn universal language representations from large volumes of text data using self-supervised learning and transfer this knowledge to downstream tasks. These models provide good background knowledge to downstream tasks which avoids training of downstream models from scratch [43]. GPT [44] and BERT [45] are the first Transformer-based pretrained language models developed based on transformer decoder and encoder layers respectively.

In general, an encoder-based T-PTLM consists of an embedding layer followed by a stack of encoder layers. For example, the BERT-base model consists of 12 encoder layers while the BERT-large model consists of 24 encoder layers. The output from the last encoder layer is treated as the final contextual representation of the input sequence. In general, encoder-based models like BERT are used in Natural Language Understanding (NLU) tasks.

Transformer-based models can parallelize computation without considering the sequential information suitable for large scale datasets, making it popular for NLP tasks. Thus, some other works are used for text classification tasks and get excellent performance, such as RoBERTa [46], XLNet [47], Bart [48], deBERTa [49], ERNIE [50].

In NLP tasks related to stylometry, some lexicons have been employed as EuroWordNet [51], Spanish Emotion Lexicon (SEL) [52], [53] perform a lexicon-based sentiment analysis of short texts generated on the social network Twitter in Spanish, [54] a lexicon-based approach to extract sentiment from text, Bing Liu English Lexicon or polarity classification [55], Spanish Opinion Lexicon (SOL) [56].

About corpus, there are some publicly available corpora for Stylometrics, they are important for NLP-related research, Autextification dataset [57], Enron [58], IMDB1M reviews [59], Guardian10 corpus [60].There are several important corpora for specific tasks in the Spanish language, such as Spanish-language corpus for researching offensive language OffendES [61], SFU Spanish review corpus [62], PoliCorpus 2020 [63], eSOLHotel. For the shared task on Multi-Author Writing Style Analysis PAN@CLEF2023 [64], PAN22 Style Change Detection [65], PAN21 Profiling Hate Speech Spreaders on Twitter [66], PAN20 Profiling Fake News Spreaders on Twitter [67].

Regarding the main tasks related to stylometry, there are the shared evaluation campaigns of Natural Language Processing (NLP) systems in Spanish and other languages, such as Automatically generated texts identification: Human or Generated, Model Attribution in AuTexTification in IberLEF 2023. Spanish Author Profiling for Political Ideology in IberLEF 2022. PAN CLEF shared tasks; Multi Author Writing Style Analysis PAN23, Style Change Detection PAN22, Profiling Hate Speech Spreaders on Twitter PAN21, Profiling Fake News Spreaders in Twitter PAN20, Celebrity Profiling PAN20, Bots and Gender Profiling PAN19, among others.

Research proposal, hypothesis

Neural networks are capable of capturing stylistic information, that information combined with previously known stylistic features such as character-level characteristics, word, phrase, vocabulary richness, lexical complexity, etc., can help solve tasks such as authorship attribution, profiling users based on their writing, differentiating between synthetic text and human-written text, etc.

The question arises; what does a neural network learn that is related to style?

We propose to further investigate the topic and determine what information about style is contained throughout the layers of pre-trained Transformer-based models, and experiment with methods of extracting their embeddings to refine learning models in text classification tasks, especially in Spanish.

Methodology and proposed experiments

An exhaustive analysis of the state of the art has been carried out, determining the classical techniques for the extraction of style features from Spanish and English texts, exploring with them in different application domains and tasks.

We participated in the main international forums on NLP tasks such as PAN, IberLEF, Semeval. We use in our experiments the reference datasets proposed in those campaigns, and thus compare our results with those obtained by other researchers.

We are experimenting with current neural network models to determine what they learn about style. To this end, we are currently exploring with the extraction of initial embeddings from all layers of BERT-based Transformers models to fine-tune a learning model for various text classification tasks.

In terms of dissemination of our results, we have published several scientific papers of worldwide impact, and we are also participating in international scientific conferences such as SePLN, Laccei and SmartTech.

Specific research elements proposed

We explore the capacity of linguistic features of various kinds that can be extracted from the text to be considered elements of style, such as lexical diversity, lexical complexity, syntactic and semantic features, etc.

We ask whether there are style features in the parameters that a neural network learns, whether style is encoded in any way in a deep neural network such as Transformer-based models, and if so, where and how. In the deeper layers of a Tranformer encoder it is possible that there is information about the style rather than the semantics. In this sense, we are analyzing, based on a series of works, not only the final coding of BERT-based Transformer models, but also the first and intermediate layers of coding in search of style features, so we are exploring ways to analyze and extract that information to improve stylistic modeling systems.

Acknowledgments

I thank the University of Jaén for allowing me to carry out my doctoral studies there, my mentors my director PhD Arturo Montejo Ráez and tutor PhD Fernando Martínez Santiago, and PhD Luis Alfonso Ureña López coordinator of the program.

The sense of style: The thinking person's guide to writing in the 21st century SPinker 2015 Penguin Books Surveying stylometry techniques and applications TNeal KSundararajan AFatima YYan YXiang DWoodard ACM Computing Surveys (CSuR) 50 2017 Cvc Anuario El español: una lengua viva. El español en cifras 2022. 2022 Informe SEDe Morgan ADeMorgan Memoir of Augustus De Morgan Longmans, Green, and Company 1882 The characteristic curves of composition TCMendenhall Science 1887 Selected studies of the principle of relative frequency in language GKZipf 1932. 1932 The statistical study of literary vocabulary GUYule Mathematical Proceedings of the Cambridge Philosophical Society 42 Inference and disputed authorship FMosteller DLWallace The Federalist 1964 Word-patterns and story-shapes: The statistical analysis of narrative style JFBurrows Literary & Linguistic Computing 2 1987 Statistical analysis and some major determinants of literary style JFBurrows Computers and the Humanities 23 1989 an ocean where each kind Not unles you ask nicely: The interpretative nexus between analysis and information JFBurrows Literary and Linguistic Computing 7 1992 Neural computation in stylometry i: An application to the works of shakespeare and fletcher RAMatthews TVMerriam Literary and Linguistic computing 8 1993 Neural network applications in stylometry: The federalist papers FJTweedie SSingh DIHolmes Computers and the Humanities 30 1996 DGuthrie Unsupervised detection of anomalous text Citeseer 2008 Ph.D. thesis A survey of modern authorship attribution methods EStamatatos Journal of the American Society for information Science and Technology 60 2009 N-gram feature selection for authorship identification JHouvardas EStamatatos Artificial Intelligence: Methodology, Systems, and Applications: 12th International Conference, AIMSA 2006

Varna, Bulgaria

Springer September 12-15, 2006. 2006 Proceedings 12 Language independent authorship attribution using character level language models (??? FP D S VKeselj SWang A relational unsupervised approach to author identification FLeuzzi SFerilli FRotella New Frontiers in Mining Complex Patterns: Second International Workshop, NFMCP 2013, Held in Conjunction with ECML-PKDD 2013 Revised Selected Papers

Prague, Czech Republic

Springer September 27, 2013. 2014 2 A framework for authorship identification of online messages: Writing-style features and classification techniques RZheng JLi HChen ZHuang Journal of the American society for information science and technology 57 2006 Authorship attribution of micro-messages RSchwartz OTsur ARappoport MKoppel Proceedings of the 2013 Conference on empirical methods in natural language processing the 2013 Conference on empirical methods in natural language processing 2013 Authorship attribution in the wild MKoppel JSchler SArgamon Language Resources and Evaluation 45 2011 RBaeza-Yates BRibeiro-Neto Modern information retrieval

New York

ACM press 1999 463 Automatic indexing: an experimental inquiry MEMaron Journal of the ACM (JACM) 8 1961 Nearest neighbor pattern classification TCover PHart IEEE transactions on information theory 13 1967 Text categorization with support vector machines: Learning with many relevant features TJoachims European conference on machine learning Springer 1998 Overview of the author profiling task at pan 2013 FRangel PRosso MKoppel EStamatatos GInches CLEF conference on multilingual and multimodal information access evaluation CELCT 2013 Artificial neural networks TMMitchell Machine learning 45 127 1997 Random forests LBreiman Machine learning 45 2001 TMikolov KChen GCorrado JDean arXiv:1301.3781 Efficient estimation of word representations in vector space 2013 arXiv preprint Glove: Global vectors for word representation JPennington RSocher CDManning Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) the 2014 conference on empirical methods in natural language processing (EMNLP) 2014 NKalchbrenner EGrefenstette PBlunsom arXiv:1404.2188 A convolutional neural network for modelling sentences 2014 arXiv preprint PLiu XQiu XHuang arXiv:1605.05101 Recurrent neural network for text classification with multi-task learning 2016 arXiv preprint Long short-term memory SHochreiter JSchmidhuber Neural computation 9 1997 Convolutional sequence to sequence learning JGehring MAuli DGrangier DYarats YNDauphin International conference on machine learning

PMLR

2017 Attention is all you need AVaswani NShazeer NParmar JUszkoreit LJones ANGomez ŁKaiser IPolosukhin Advances in neural information processing systems 30 2017 DBahdanau KCho YBengio arXiv:1409.0473 Neural machine translation by jointly learning to align and translate 2014 arXiv preprint Deep residual learning for image recognition KHe XZhang SRen JSun Proceedings of the IEEE conference on computer vision and pattern recognition the IEEE conference on computer vision and pattern recognition 2016 JLBa JRKiros GEHinton arXiv:1607.06450 Layer normalization 2016 arXiv preprint Y.-AWang Y.-NChen arXiv:2010.04903 What do position embeddings learn? an empirical study of pre-trained language model positional encoding 2020 arXiv preprint ZHuang DLiang PXu BXiang arXiv:2009.13658 Improve transformer models with better relative position embeddings 2020 arXiv preprint Pre-trained models for natural language processing: A survey XQiu TSun YXu YShao NDai XHuang Science China Technological Sciences 63 2020 A survey on text classification: From traditional to deep learning QLi HPeng JLi CXia RYang LSun PSYu LHe ACM Transactions on Intelligent Systems and Technology (TIST) 13 2022 KSKalyan ARajasekharan SSangeetha arXiv:2108.05542 Ammus: A survey of transformer-based pretrained models in natural language processing 2021 arXiv preprint Improving language understanding by generative pre-training ARadford KNarasimhan TSalimans ISutskever 2018 JDevlin M.-WChang KLee KToutanova arXiv:1810.04805 Bert: Pre-training of deep bidirectional transformers for language understanding 2018 arXiv preprint YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov arXiv:1907.11692 Roberta: A robustly optimized bert pretraining approach 2019 arXiv preprint Xlnet: Generalized autoregressive pretraining for language understanding ZYang ZDai YYang JCarbonell RRSalakhutdinov QVLe Advances in neural information processing systems 32 2019 MLewis YLiu NGoyal MGhazvininejad AMohamed OLevy VStoyanov LZettlemoyer arXiv:1910.13461 Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension 2019 arXiv preprint PHe XLiu JGao WChen arXiv:2006.03654 Deberta: Decoding-enhanced bert with disentangled attention 2020 arXiv preprint YSun SWang YLi SFeng XChen HZhang XTian DZhu HTian HWu arXiv:1904.09223 Ernie: Enhanced representation through knowledge integration 2019 arXiv preprint A multilingual database with lexical semantic networks PVossen 1998 Kluwer Academic Publishers 10 Dordrecht Empirical study of machine learning based approach for opinion mining in tweets GSidorov SMiranda-Jiménez FViveros-Jiménez AGelbukh NCastro-Sánchez FVelásquez IDíaz-Rangel SSuárez-Guerra ATrevino JGordon Advances in Artificial Intelligence: 11th Mexican International Conference on Artificial Intelligence, MICAI 2012

San Luis Potosí, Mexico

Springer October 27-November 4, 2012. 2013 Revised Selected Papers, Part I 11 Lexicon-based sentiment analysis of twitter messages in spanish AMoreno-Ortiz CPHernández Procesamiento del lenguaje natural 50 2013 Lexicon-based methods for sentiment analysis MTaboada JBrooke MTofiloski KVoll MStede Computational linguistics 37 2011 Mining and summarizing customer reviews MHu BLiu Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining the tenth ACM SIGKDD international conference on Knowledge discovery and data mining 2004 Semantic orientation for polarity classification in spanish reviews MDMolina-González EMartínez-Cámara M.-TMartín-Valdivia JMPerea-Ortega Expert Systems with Applications 40 2013 Autextification dataset (full data ASarvazyan JÁngel González MFranco FMRangel MAChulvi PRosso 10.5281/zenodo.7956207 2023 The enron corpus: A new dataset for email classification research BKlimt YYang Machine Learning: ECML 2004: 15th European Conference on Machine Learning

Pisa, Italy

Springer September 20-24, 2004. 2004 Proceedings 15 Personalised rating prediction for new users using latent factor models YSeroussi FBohnert IZukerman Proceedings of the 22nd ACM conference on Hypertext and hypermedia the 22nd ACM conference on Hypertext and hypermedia 2011 On the robustness of authorship attribution based on character n-gram features EStamatatos JL & Pol'y 21 421 2012 Offendes: A new corpus in spanish for offensive language research FMPlaza-Del Arco AMontejo-Ráez LAUrena-López MMartín-Valdivia Proceedings of the International Conference on Recent Advances in Natural Language Processing the International Conference on Recent Advances in Natural Language Processing RANLP 2021. 2021 SFU Review Corpus | Maite Taboada MTaboada 2017 Psychographic traits identification based on political ideology: An author analysis study on spanish politicians' tweets posted in 2020 JAGarcía-Díaz RColomo-Palacios RValencia-García Future Generation Computer Systems 130 2022 Pan23 multi-author writing style analysis EZangerle MMayerl MPotthast BStein 10.5281/zenodo.7729178 2023 Pan22 authorship analysis: Style change detection EZangerle MMayerl MTschuggnall MPotthast BStein 10.5281/zenodo.6334245 2022 Profiling hate speech spreaders on twitter FRangel BChulvi GL D LPeña EFersini PRosso 10.5281/zenodo.4603578 2021 Profiling fake news spreaders on twitter FRangel PRosso BGhanem AGiachanou 10.5281/zenodo.4039435 2020