1. Introduction and Background

Unmasking the Wordsmith: Revealing Author Identity through Reader Reviews

Chiara Alzetta

Felice Dell'Orletta

Chiara Fazzone

Alessio Miaschi

Giulia Venturi

0 0 ItaliaNLP Lab, CNR, Istituto di Linguistica Computazionale 'A.Zampolli' , Pisa , Italy

Traditional genre-based approaches for book recommendations face challenges due to the vague definition of genres. To overcome this, we propose a novel task called Book Author Prediction, where we predict the author of a book based on user-generated reviews' writing style. To this aim, we first introduce the 'Literary Voices Corpus' (LVC), a dataset of Italian book reviews, and use it to train and test machine learning models. Our study contributes valuable insights for developing user-centric systems that recommend leisure readings based on individual readers' interests and writing styles.

eol>Book Author Prediction Italian reviews stylistic analysis user-generated book reviews

1. Introduction and Background

[ 10 ]. Nevertheless, these models often face challenges when book content is inaccessible due to licensing reReading for pleasure is currently experiencing a signif- strictions. icant decline, as evidenced by surveys indicating that Consequently, an alternative and promising line of leisure reading has reached an unprecedented low1. Book research on book recommender systems involves leverrecommender systems have been proposed as a valuable aging user reviews as a valuable source of information for tool to promote the practice of reading for pleasure [ 1 ]. generating recommendations. Analyzing reviews allows These systems provide personalized suggestions and aid for a unique perspective on books from the viewpoint of users in navigating the vast array of available literary their readers, without requiring access to their content. works [ 2 ]. Their integration into e-commerce services Reviews ofer valuable insights into readers’ opinions has long been explored, as it benefits both sellers and and preferences, and they have been efectively utilized consumers [ 3 ]. to predict trends in the book market [ 11, 12, 13, 14, 15 ].

Typically integrated with online platforms, book rec- There are few attempts to exploit user reviews also for litommender systems rely on the history of users to pre- erary genre identification. These include [ 16 ] and [ 17 ] for dict their future interests and provide recommendations English and Portuguese book reviews respectively. We based on the literary genre or authors that users have have also contributed to this line of research by focusing previously engaged with. While recommending the other on Italian book reviews [ 18 ]. In our previous work, we books by an author that the reader enjoyed is trivial, sug- demonstrated how book reviews published by amateur gesting books belonging to the same genre remains a readers on two social reading platforms, namely Amazon complex area of study, particularly concerning literary and Goodreads, can be exploited to automatically identify novels [ 4 ]. This is mostly due to the fact that the notion the genre of the reviewed book. of genre represents a quite heterogeneous object of study Building upon our prior investigations, our current due to multiple factors [ 5 ]. In fact, the same book can research aims to explore whether the writing style of be assigned to more than one literary genre either on user-generated reviews, analyzed in terms of lexical and the same reading platform or across diverse platforms. (morpho-)syntactic characteristics, can serve as a reliable Accordingly, various approaches have been proposed to source of information also to predict the author of a reautomatically identify literary genres using book content viewed book. We started from the assumption that the [ 6, 7, 8 ], titles or summaries [ 9 ], and even cover designs vague definition of literary genres might make recommendations based on related authors more efective than CLiC-it 2023: 9th Italian Conference on Computational Linguistics, genre-based approaches. To this end, inspired by the Nov 30 — Dec 02, 2023, Venice, Italy literature on Authorship Attribution [ 19 ], we introduced $ chiara.alzetta@ilc.cnr.it (C. Alzetta); felice.dellorletta@ilc.cnr.it a novel task named Book Author Prediction. We tackle (F. Dell’Orletta); chiara.fazzone@ilc.cnr.it (C. Fazzone); the problem as a supervised classification task, where the (aGle.sVsieon. mtuirais)chi@ilc.cnr.it (A. Miaschi); giulia.venturi@ilc.cnr.it objective is to predict the author of a given book from a CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ©CCo2Em02mU3oCRnospLWyicreiognhsrtekfAostrthrtihboiusptpioanpPe4rr.0obIynctietesrenaaduttiihononragsl.s(CUCs(eCBpYEer4mU.0i)tR.te-d WundSe.roCrregat)ive suentliokfeptohtee ntrtaiadlitciaonndaildAatuetsh.oIrtsihsiipmAptotrrtibanuttitoonntoastek,thoautr, 1See https://www.istat.it/it/archivio/284591, https: information source consists of user-generated reviews //literacytrust.org.uk/research-services/annual-literacy-survey/ rather than the books authored by the novelists them- books mainly as a consumer good. Goodreads reviews selves. This distinction adds a layer of complexity to the are typically exploited to predict the orientation of the task, making it particularly challenging and novel in its book market [ 11, 13 ], to map reading preferences across approach. As a crucial step towards this objective, we in- various communities of users [ 20 ], as well as to analyze troduce a novel dataset of Amazon2 and Goodreads3 book the linguistic style adopted by readers to describe their reviews, the ‘Literary Voices Corpus’ (LVC). The dataset reading experiences [ 21, 22 ]. Conversely, reviews posted successfully served in diverse experimental settings we on Amazon Books have mostly been investigated within explored in this work aimed at training and testing pre- marketing and buyers’ behaviour studies, often relying trained and traditional machine learning models, that use on sentiment analysis [ 23, 24, 25 ]. diferent configurations of lexical and (morpho-)syntactic When building LVC, we first chose popular novelists features, to accomplish the new prediction task. in order to acquire a diverse but rich collection of reviews

The work presented in this study falls within the con- from amateur readers. These are J.K. Rowling, Stephen text of collective eforts to foster the habit of reading and King, J.R.R. Tolkien, Jane Austen, Sarah J. Maas, and Dan enlarge the readership across diferent target audiences 4. Brown.6 Among these initiatives, LettERE (Letture pER TE) is a Since literary genre is not a monolithic notion [ 4 ], the project that aims to encourage and promote the practice books of these authors traverse multiple genres. For of reading by creating a reading recommendation system example, King’s repertoire encompasses horror, thriller, that provides personalised recommendations tailored to and science-fiction, while Maas’s fantasy novels also inthe reader’s language skills and interests (see Acknowl- corporate a substantial element of romance. Then, we edgements). In this regard, the research presented in this extracted the reviews for their respective books from the paper contributes significantly to the LettERE project’s ‘A Good Review’ corpus and we integrated the set with objectives by showing that user-generated reviews can new books if necessary using the ISBN number of a book be efectively used to identify readers sharing common to unambiguously identify it on Amazon and Goodreads interests and ultimately provide personalised book rec- and to collect its reviews written in Italian. This was done ommendations. to reach a minimum of 1,100 reviews per novelist from

The remainder of the paper is organised as follows. Goodreads and 800 reviews from Amazon. While we Section 2 presents LVC, the novel collection of Italian successfully obtained the desired number of reviews for book reviews referring to the books of six popular au- most authors, we encountered challenges for Austen and thors. Section 3 introduces the Book Author Prediction Maas on Amazon. Nonetheless, the number of reviews task and details the methodology and models exploited collected for these authors can still be considered reasonin this work to address it. Section 4 presents the results ably comparable to the desired amount. The statistics of of our experiments. Finally, Section 5 ofers conclusions the final LVC dataset are reported in Table 1. and outlines potential future research directions. As can be noted, the two portions of the dataset (i.e.,

Amazon and Goodreads) are quite diferent in terms of

the length of a single review. This diference arises in part 2. The Literary Voices Corpus from the lower number of reviews collected from Amazon, but mostly from the comparatively greater length of Goodreads reviews in terms of sentences and tokens.

Thus, achieving a balanced number of reviews across au

thors does not correspond to an equal number of tokens.

Furthermore, we notice a tendency to produce longer re

views among the readers of certain authors, such as King,

Maas, or Austen, on both platforms. This represents one of the first general characterization of the diversity across literary voices we collected. We performed our experiments on the ‘Literary Voices Corpus’ (LVC), which encompasses a collection of book

reviews in Italian published on two leading platforms for Digital Social Reading (DSR), Amazon Books and

Goodreads and covering the work of several authors of

ifction novels. 5 This corpus is a spin-of of the ‘A Good

Review’ corpus, which we introduced in [18]. The LVC

corpus is aimed at being representative of two diferent approaches to writing book reviews, a diversity specific to the peculiarities of the two platforms. In fact, while

Goodreads gathers a large community of amateur readers to exchange opinions and reading recommendations, Amazon has a marked commercial vocation and treats

3. Book Author Prediction

2https://www.amazon.it 3https://www.goodreads.com 4See for instance: https://www.regione.toscana.it/-/

un-patto-per-la-lettura.

5The LVC corpus is freely available under request for research

purposes.

The novel task of Book Author Prediction consists of

predicting the author of a book from the readers’ reviews. We explored the performance on the task of a suite of machine learning algorithms that vary with re

6The complete list of books whose reviews in Italian have been

included in LVC can be found in Appendix A. 7 1,100 6,224 180,680

5.65 164.25

6 800 2,695 48,275 3.36 60.34

3.1. Models

Linear Support Vector Machine We define two LinearSVM models, referred to as ‘Profiling’ and ‘Ngrams’ models. The former takes the set of linguistic characteristics described in Sec. 3.2. Ngrams exploits lexical information since it uses as input feature a simple contiguous sequence of n words acquired from the reviews (i.e. n-grams, with n equal to 1, 2, and 3).

Neural Language Model We relied on the Italian pre

trained version of the BERT model (12 layers, 768 hidden units) [ 27 ]7, which was pretrained using the Italian

Wikipedia and the Italian portion of the OPUS corpus [28], a multilingual collection of translated open source documents available on the Internet, and fine-tuned on the Book Author Classification task. LinearSVM + NLM We combined the previous models

Table 2 into a classifier based on LinearSVM and trained using the Linguistic features acquired from book reviews. internal representations of the BERT model fine-tuned on the author classification tasks. We refer to this model as SVM (BERT). SVM (BERT+Profiling) is an additional Linspect to the architecture and features used for training earSVM model trained using both the fine-tuned repre(see Section 3.1). The models leverage a wide spectrum of sentations produced by BERT and Profiling-UD features. text properties acquired from the reviews of increasing The BERT representations used as input features of the informativeness, which range from n-grams of words SVM model were computed by averaging the embeddings to stylistic features (Section 3.2), up to contextual sen- of all the tokens in each review. tence representations of Neural Language Models. For all models, we adopted a 5-fold cross-validation approach Baselines We compared the performance of the above for training and testing. The train and test sets always models against a random uniform classifier, i.e. a model contain reviews of diferent books, thus increasing the that uniformly generates random predictions for each complexity of the classification tasks. Note that, consid- author. ering the high discriminative power of proper nouns in this classification scenario, we performed the linguistic analysis of reviews and sanitized the text [ 26 ] by masking all tokens marked as proper nouns (POS = PROPN). Model Baseline Profiling Ngrams BERT SVM (BERT) SVM (BERT + Profiling) Average Baseline Profiling Ngrams BERT SVM (BERT) SVM (BERT+Profiling) Average 0.16 0.25 0.44 0.74 0.56 0.57 0.51 0.14 0.22 0.39 0.61 0.43 0.36 0.40 0.16 0.26 0.44 0.73 0.54 0.52 0.50 0.16 0.26 0.42 0.61 0.46 0.43 0.44

3.2. Linguistic Features

To model the linguistic properties of the reviews, we relied on a set of 150 linguistic features. These features correspond to specific aspects of the document structure and were derived using Profiling-UD [ 29 ], a web-based tool conceived to linguistically profile multilingual texts by relying on the Universal Dependencies (UD) formalism [ 30 ]. The features encompass 9 dimensions of document structure, which are detailed in Table 2. They range from morpho-syntactic and inflectional properties to more complex aspects of sentence structure, such as the depth of the syntactic tree. Other features pertain to the structure of sub-trees and include the order of subjects and objects in relation to the verb, as well as the use of subordination.

4. Results

hibit on average higher accuracy scores overall. This is possibly due to a typical trait of commercial platforms like

Amazon, whose reviews frequently encompass aspects

beyond the book’s content, such as parcel delivery or the edition’s book cover. These topics cause the reviews to be quite standardised, thus more dificult to discriminate.

Conversely, Goodreads reviews primarily focus on the

book’s content possibly containing a larger amount of stylistic elements which help the automatic classification.

This trend holds also when classifying individual authors,

except Rowling for the Profiling and Ngrams models.

When looking at the results obtained for individual authors, Sara J. Maas turned out to be the most accurately predicted author on both platforms, considering the average scores across all models. However, upon closer inspection of the results obtained with the topperforming model (BERT ), we observe that while Maas remains the most accurately identified author in Amazon reviews, the reviews of Jane Austen’s books exhibit the highest level of distinctiveness on Goodreads.

Table 3 presents the classification accuracies for the task of Book Author Prediction. Notably, all models outperformed the random uniform baseline on both Amazon 4.1. Discussion and Goodreads. Upon closer examination of the models, we notice that lexical information has more discrimina- To take a closer look at the classification results, Fig. 1 tive power than linguistic properties in the task. As proof, reports the confusion matrices with the percentage of consider the global and author-level scores obtained by the predictions made by all models in the Book Author the Profiling model compared to the Ngram and, most Prediction task. This complements the classification renotably, the BERT models. Interestingly, using the fine- sults by showing which authors are more confusing and tuned BERT representations as input features for the SVM which are the most wrongly classified ones. classifier ( SVM (BERT)) yielded lower results than simply In general, we observe that as the model performance using pre-trained BERT, and the results are comparable – improves, the matrices become less sparse, regardless of or lower – when combining contextualized representa- the platform. This means that when the correct author tions with linguistic features (SVM (BERT+Profiling) ). is predicted most of the time, the erroneous predictions

Comparing the two platforms, Goodreads reviews ex- are distributed quite evenly among all possible authors. Consider, for instance, the matrices obtained from the to Goodreads reviews, we observe that Maas is the most analysis of BERT and compare them with the matrices frequently predicted author, leading to other authors’ referring to the Profiling and Ngrams models, which yield books being frequently misclassified as Maas’s works. the most sparse matrices. Notably, the reviews of It by King and of the fourth book

Notable diferences arise in the distribution of pre- from the Harry Potter saga by Rowling are often incordicted authors across the two platforms. For in- rectly assigned to Maas. The content of these books, at stance, when considering the Profiling model applied the crossroads between the fantasy and horror genres, may contribute to the model confusion. However, the BERT model. Both authors, despite their diferences, are most influencing factor to the Profiling model predictions known for building suspense and tension in their narraappears to be the review length. On Goodreads, reviews tives and incorporating detailed historical settings and of King’s and Rowling’s books that are longer than 150 psychological aspects into their work. tokens are wrongly classified as referring to Maas in over The classification of Goodreads review performed by 40% of cases. On Amazon, we observe an opposite ten- the SVM (BERT) and SVM (BERT + Profiling) models dency, but for a diferent author: when a review has less highlight author commonalities that did not emerge so than 10 tokens, the model assigns the review to Rowling strongly with other models. The reviews of Rowling’s in around 60% of cases. books, for instance, are frequently wrongly classified

The analysis of the feature rankings8 produced by the as referring to Maas’s work. Both authors are known classifiers trained on both Amazon and Goodreads re- for their contributions to popular literature, particularly views confirms the importance of review length for the in the genres of fantasy and young adult fiction, which Profiling model. Indeed, features that capture structural attract a readership interested in exploring themes of perproperties are particularly relevant for the model: the sonal growth and self-discovery through the characters’ use of subordination (subordinate_dist) is crucial for clas- coming-of-age journeys. sifying Rowling’s and King’s reviews on Goodreads, as Overall, no particular author appears to be systematithey exhibit respectively the lowest and highest use of cally confused by all models. This finding is particularly subordinate clauses. Conversely, on Amazon, the average interesting from our perspective since it shows that using number of verb dependents (verb_edges) and the distribu- user-generated reviews as an information source allows tion of function words (namely, conjunctions, auxiliary to successfully address the Book Author Prediction task. verbs and determiners) are discriminative for Rowling, It suggests that books authored by diferent novelists atTolkien, and Maas. tract readers who are interested in similar topics and also

For what concerns the Ngram model, the feature rank- adopt similar communication strategies in their writing. ing consists of the n-grams employed by the model or- It also implies that the proposed methodology could have dered by relevance for book author classification pur- a positive impact on the development of user-centric poses on Amazon and on Goodreads. Quite expectedly, book recommender systems. the analysis of the top 100 most relevant n-grams reveals that, on Amazon, parcel delivery is a highly referenced topic (e.g. ‘tempi previsti’, expected timing, and ‘ben con- 5. Conclusions fezionato’, well packaged), especially among the readers of Tolkien and Rowling, which have the most similar This paper has explored an innovative approach that n-gram rankings (Spearman correlation score = 0.235, leverages user reviews as a source of information for < 0.05). The two authors are the most frequently Book Author Prediction. Building upon our prior work, confused by the model, especially for what concerns the we introduced a novel dataset of Amazon and Goodreads reviews of Tolkien’s ‘The Hobbit’ and ‘The Silmarillion’, book reviews, LVC, which has been used for training wrongly classified as referring to Rowling’s books. In- and evaluating machine learning models addressing the deed, it is possible that the two authors attract a sim- novel book author prediction task. ilar readership interested in books involving intricate Our findings highlight the challenging nature of premythologies, and that feature multi-dimensional charac- dicting the author of a novel from a reader’s review. Howters with strengths, flaws, and internal struggles. Such ever, the analysis of erroneous predictions pointed us to closeness between the Amazon reviews of these authors cases of books sharing a similar readership. This observais captured also by the BERT model which, although per- tion supports the intuition that user-generated reviews forming better than other models on the task, seems quite can efectively serve as a basis for personalized book recconfused by the reviews of the same Tolkien books. ommendations. By analyzing reviews, we gained insights

On Goodreads reviews, where parcel delivery is not rel- into readers’ preferences beyond the writing style of the evant, the most impactful n-grams tend to revolve around book’s author, opening up new avenues for more tailored and user-centric recommendations. book appreciation (e.g., ‘ho apprezzato’, I appreciated; ‘letMoving forward, this research could be expanded by tura piacevole’, pleasant reading; ‘non mi aspettavo’, I did investigating the impact of exploiting user judgments as not expect) or plot (‘il maghetto’, the little wizard; ‘signore an additional feature for classification. Furthermore, the di’, lord of; ‘chiesa’, church; ‘di epoca’, historical; ‘drago’, dragon; ‘di vampiri’, of vampires). Therefore, it is not sentiment expressed by readers about a book, whether surprising to see that King’s reviews are most frequently positive or negative, could be leveraged to validate and misclassified as referring to Brown’s work, also by the ifne-tune personalized recommendations.

8See Appendix B and C. Acknowledgments We thank the “Letture pER TE” (LettERE) project (2022

2024) funded by Regione Toscana (Progetti Congiunti di

Alta Formazione – POR FSE 2014-2020 Investimenti a favore della crescita e dell’occupazione) in collaboration with M.E.T.A. Srl company. A. Books of the Literary Voices Corpus

Author

B. Feature ranking Profiling Model (Goodreads) Feature ttr_lemma_chunks_100 ttr_form_chunks_100 aux_tense_dist_Pres ttr_form_chunks_200 ttr_lemma_chunks_200 n_prepositional_chains n_tokens upos_dist_AUX upos_dist_ADP dep_dist_orphan upos_dist_DET aux_mood_dist_Ind dep_dist_aux aux_tense_dist_Imp dep_dist_case dep_dist_cop dep_dist_mark verbs_form_dist_Part dep_dist_flat:name aux_num_pers_dist_Sing+3

Stephen King Feature upos_dist_CCONJ dep_dist_cc avg_prepositional_chain_len ttr_form_chunks_200 ttr_lemma_chunks_200 prep_dist_2 prep_dist_1 subordinate_post dep_dist_orphan prep_dist_3 subordinate_dist_1 tokens_per_sent n_tokens aux_tense_dist_Pres ttr_lemma_chunks_100 avg_verb_edges subordinate_pre verbal_head_per_sent dep_dist_case upos_dist_ADP

C. Feature ranking Profiling Model (Amazon) Feature upos_dist_AUX dep_dist_det dep_dist_aux upos_dist_ADV ttr_lemma_chunks_100 ttr_form_chunks_100 dep_dist_cop upos_dist_DET dep_dist_root dep_dist_advmod verb_edges_dist_2 verb_edges_dist_3 ttr_lemma_chunks_200 aux_tense_dist_Pres verb_edges_dist_4 avg_verb_edges verbs_form_dist_Part dep_dist_case ttr_form_chunks_200 verb_edges_dist_1

Stephen King Feature dep_dist_det dep_dist_cc upos_dist_CCONJ upos_dist_DET ttr_form_chunks_200 ttr_lemma_chunks_200 avg_verb_edges verbs_form_dist_Part aux_tense_dist_Pres verbs_form_dist_Fin verbs_form_dist_Inf lexical_density ttr_lemma_chunks_100 verb_edges_dist_1 dep_dist_root upos_dist_AUX dep_dist_aux principal_proposition_dist dep_dist_det:poss dep_dist_flat:foreign

[1]

Alharthi ,

Inkpen ,

Szpakowicz , Authorship identification for literary book recommendations , in: Proceedings of the 27th International Conference on Computational Linguistics (COLING) , ACL , 2018 , pp. 390 - 400 .

[2]

Alharthi ,

Inkpen ,

Szpakowicz , A survey of book recommender systems , Journal of Intelligent Information Systems 51 ( 2018 ) 139 - 160 .

[3]

J. B.

Schafer ,

Konstan ,

Riedl , Recommender systems in e-commerce, in: Proceedings of the 1st ACM conference on Electronic commerce , 1999 , pp. 158 - 166 .

[4] J.-M. Schaefer , Qu'est-ce qu'un genre littéraire? , Seuil , 1989 .

[5]

Biber ,

Conrad , Genre, Register, Style, Cambridge University Press, 2009 .

[6]

Shamir , UDAT: Compound quantitative analysis of text using machine learning , Digital Scholarship in the Humanities 36 ( 2020 ) 187 - 208 .

[7] Rahul , Ayush , D.

Agarwal , D.

Vijay , Genre classification using character networks , in: Proceedings of the 5th International Conference on Intelligent Computing and Control Systems (ICICCS) , IEEE, 2021 , pp. 216 - 222 .

[8]

Worsham ,

Kalita , Genre identification and the compositional efect of genre in literature , in: Proceedings of the 27th International Conference on Computational Linguistics (COLING) , ACL , 2018 , pp. 1963 - 1973 .

[9]

Ozsarfati ,

Sahin ,

C. J.

Saul ,

Yilmaz , Book genre classification based on titles with comparative machine learning algorithms , in: Proceedings of 2019 4th International Conference on Computer and Communication Systems (ICCCS) , IEEE, 2019 , pp. 14 - 20 .

[10]

Buczkowski ,

Sobkowicz ,

Kozlowski , Deep learning approaches towards book covers classification , in: Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM) , SCITEPRESS-Science and Technology Publications , 2018 , pp. 309 - 316 .

[11]

Wang ,

Liu , Y. Han, Exploring Goodreads reviews for book impact assessment , Journal of Informetrics 13 ( 2019 ) 874 - 886 .

[12]

Aerts ,

Smits ,

P. W.

Verlegh , How online consumer reviews are influenced by the language and valence of prior reviews: A construal level perspective , Computers in Human Behavior 75 ( 2017 ) 855 - 864 .

[13]

S. K.

Maity ,

Panigrahi ,

Mukherjee , Analyzing social book reading behavior on Goodreads and how it predicts Amazon best sellers, Influence and Behavior Analysis in Social Networks and Social Media ( 2019 ) 211 - 235 .

[14]

Dimitrov ,

Zamal ,

Piper ,

Ruths , Goodreads versus Amazon: the efect of decoupling book reviewing and book selling , in: Proceedings of International AAAI Conference on Web and Social Media (ICWSM) , volume 9 , 2015 , pp. 602 - 605 .

[15]

Thelwall , Reader and author gender and genre in Goodreads , Journal of Librarianship and Information Science 51 ( 2019 ) 403 - 430 .

[16]

Saraswat , Leveraging genre classification with rnn for book recommendation , International Journal of Information Technology ( 2022 ) 1 - 6 .

[17]

Scofield ,

M. O.

Silva , L. de Melo-Gomes , M. M. Moro , Book genre classification based on reviews of portuguese-language literature , in: Proceedings of the International Conference on Computational Processing of the Portuguese Language (PROPOR) , 2022 , pp. 188 - 197 .

[18]

Alzetta ,

Dell'Orletta ,

Miaschi , E. Prat, G. Venturi, Tell me how you write and I'll tell you what you read: a study on the writing style of book reviews , Journal of Documentation Forthcoming ( 2023 ).

[19]

Stamatatos , A survey of modern authorship attribution methods , Journal of the American Society for Information Science and Technology 60 ( 2009 ) 538 - 556 .

[20]

Bourrier ,

Thelwall , The social lives of books: Reading victorian literature on Goodreads , Journal of Cultural Analytics 5 ( 2020 ) 12049 .

[21]

Driscoll , D. Rehberg Sedo, Faraway, so close: Seeing the intimacy in Goodreads reviews , Qualitative Inquiry 25 ( 2019 ) 248 - 259 .

[22]

Nuttall ,

Harrison , Wolfing down the twilight series: Metaphors for reading in online reviews, Contemporary media stylistics ( 2020 ) 35 - 60 .

[23]

Kaur ,

Singh , Impact of online consumer reviews on Amazon books sales: Empirical evidence from india , Journal of Theoretical and Applied Electronic Commerce Research 16 ( 2021 ) 2793 - 2807 .

[24]

Chiavetta ,

G. L.

Bosco ,

Pilato , A lexicon-based approach for sentiment classification of Amazon books reviews in Italian language , in: International Conference on Web Information Systems and Technologies (WEBIST) , volume 3 , Scitepress , 2016 , pp. 159 - 170 .

[25]

Srujan ,

Nikhil ,

H. Raghav

Rao ,

Karthik ,

Harish ,

H. Keerthi

Kumar , Classification of Amazon book reviews based on sentiment analysis , in: Information Systems Design and Intelligent Applications , Springer, 2018 , pp. 401 - 411 .

[26]

Vasudevan , A. John, A review on text sanitization , International Journal of Computer Applications 95 ( 2014 ).

[27]

Wolf ,

Debut , V. Sanh, alii, Transformers: Stateof-the-art natural language processing , in: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , ACL , 2020 , pp. 38 - 45 .

[28]

Tiedemann ,

Nygaard , The OPUS corpus - parallel and free , in: Proceedings of the Conference on Language Resources and Evaluation (LREC) , ELRA , 2004 .

[29]

Brunato ,

Cimino ,

Dell'Orletta ,

Venturi ,

Montemagni , Profiling-UD: a tool for linguistic profiling of texts , in: Proceedings of the Conference on Language Resources and Evaluation (LREC) , ELRA , 2020 , pp. 7147 - 7153 .

[30] M. C. De Marnefe , C. D.

Manning , J.

Nivre , D.

Zeman , Universal dependencies, Computational linguistics 47 ( 2021 ) 255 - 308 .