1. Introduction and Motivation

Hits or Misses? A Linguistically Explainable Formula for Fanfiction Success

Giulio Leonardi

Dominique Brunato

Felice Dell'Orletta

0 0 Istituto di Linguistica Computazionale “Antonio Zampolli”, ItaliaNLP Lab , Pisa 1 University of Pisa

This study presents a computational analysis of Italian fanfiction, aiming to construct an interpretable model of successful writing within this emerging literary domain. Leveraging explicit features that capture both linguistic style and semantic content, we demonstrate the feasibility of automatically predicting successful writing in fanfiction and we identify a set of robust linguistic predictors that maintain their predictive power across diverse topics and time periods, ofering insights into the universal aspects of engaging storytelling. This approach not only enhances our understanding of fanfiction as a genre but also ofers potential applications in broader literary analysis and content creation.

eol>fanfiction Italian corpus success prediction linguistic features Explainable Boosting Machine

1. Introduction and Motivation

machine learning ofer a powerful lens for making explicit patterns that may explain the complex interplay The growing proliferation of online literary content has between reader engagement and content success. led to the emergence of new genres and storytelling This paper moves in this field and presents a computaforms, with fanfiction being particularly popular among tional analysis focused on Italian fanfiction, addressing teens and young adults. Fanfiction consists of stories the following research questions: i.) Can the success of created by fans (mostly hobby authors) that extend or Italian fanfiction be automatically predicted using stylisalter the narrative of existing popular media like books, tic and lexical features of the texts?; ii.) Which types of movies, comics or games, and represents a significant features demonstrate the highest predictive capability, portion of user-generated content on the web [ 1 ]. In re- and how consistent are these features across diferent cent years, the widespread popularity that this genre has time periods and thematic domains?; iii.) To what exassumed has prompted research into the linguistic and tent can these features be explained in terms of their stylistic elements that contribute to its success, mirror- contribution to predicting success? ing studies conducted on more traditional literary genres Our contributions. i.) We collected a corpus of Ital[ 2, 3, 4 ], among others. ian fanfiction stories enriched with metadata considered

Understanding the elements that contribute to narra- as proxies of their success; ii.) We investigate the relationtive success is a fascinating area of research with implica- ship between stylistic and lexical features of stories and tions across various fields, from literary analysis to digital their success from a modeling perspective; iii.) We idenhumanities. From a socio-linguistic perspective, it can tified the most influential features in success prediction, ofer deeper insights into people and culture. It also has showing the key role played by form and stylistic related significant applications in areas such as personalized con- features across time and thematic domains of fanfictions. tent recommendation and educational technology [ 5, 6 ]. The paper is structured as follows: Section 2 briefly While personal interests undoubtedly play a crucial role contextualizes our study among relevant literature; Secin predicting a reader’s engagement with a literary con- tion 3 presents the reference corpus of Italian fanfiction tent, the way information is presented can also evoke stories that we collected; in Section 4 we provide an diferent reactions and levels of interaction, ultimately overview of the approach we devised including the deinfluencing the narrative’s success. In this regards, recent scription of features used for classification and the classiadvancements in Natural Language Processing (NLP) and fiers employed. Section 5 discusses the main findings and ofers a fine-grained analysis of the classification results CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, in terms of feature explainability. In Section 6 we sum*DCecor0r4es—po0n6,d2in02g4a, uPtihsao,r.Italy marize key findings and outlining promising directions † These authors contributed equally. for future research in this field. $ g.leonardi5@studenti.unipi.it (G. Leonardi); dominique.brunato@ilc.cnr.it (D. Brunato); felice.dellorletta@ilc.cnr.it (F. Dell’Orletta) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License

Attribution 4.0 International (CC BY 4.0).

2. Related Work

techniques is still limited. Mattei et al. [ 14 ] employ linguistic profiling to analyze a corpus of Italian fanfiction The exploration of online content and its engagement inspired by the Harry Potter series, with the purpose of levels has increasingly benefited from advancements in identifying linguistic patterns associated with success. NLP and machine learning. Diferent perspectives have Inspired by this previous study, our research aims to exbeen touched upon considering diferent textual domains, tend these findings through a computational modeling typology of linguistic features and quantitative metrics approach, investigating the power of linguistic features to operationalize a very subjective concept like success. for predicting fanfiction success and their generalization The study by Toubia and colleagues [ 7 ] explores how across diferent experimental settings. the structure of narratives, particularly the internal semantic progression measured by features derived from dense word representations, afects the success of stories 3. Corpus Construction across diferent text typologies (movies, TV shows, and academic papers). Berger and colleagues [ 8 ] examine As a first step, we compiled a reference corpus of Italhow the linguistic structure of online content afects user ian fanfiction. To this end, we searched available texts engagement, specifically by modeling sustainable atten- on efpfanfic.net, one of the largest Italian websites dedition. This concept goes beyond just attracting a reader cated to publishing and reading amateur stories, focusing with a catchy headline or advertisement; it also encom- specifically on stories labeled in the fanfiction genre. passes the likelihood that a reader will continue viewing Using a web scraping system, we extracted fanfictions or reading the content. In their analysis of more than based on the Harry Potter series, a highly popular fandom 35,000 online contents from heterogeneous sources, they on the site, boasting 57,196 stories published between emphasize the role of features related to processing ease 2003 and 2023. Figure 1 presents the temporal distribution and emotional language. of these fanfictions up to 2020.

In the realm of literary works, Ashok et al. [ 2 ] first Additionally, we gathered a secondary corpus consistleverage stylometric analysis and machine learning tech- ing of 2,441 stories based on The Lord of the Rings series. niques to predict the success of popular English novels This secondary corpus served as a test set to assess the from the Gutenberg Project. Their approach demon- influence of thematic domains on the analysis of story strated the potential of these techniques for assessing success. literary success. Extending these findings, Maharajan For this study, we focused on the first chapter of each et al. [ 9 ] proposed a multi-task approach to simultane- fanfiction to ensure a consistent analysis. While it is ously evaluating success and genre prediction. Using widely recognized that thematic units within stories — deep learning representations, in addition to hand-craft particularly the beginnings and endings — often difer features related to topic, sentiment, writing style, and from the middle sections due to their distinct narrative readability of books, they obtained better performance roles, we observed that the majority of stories (69%) conthan the single success prediction task approach. Focus- sist of only a single chapter, making them efectively selfing on contemporary English-language literature, the contained. The efpfanfic portal allows users to review study by Bizzoni and colleagues [ 10 ] investigate how per- each chapter with ratings marked as negative, neutral, or ceived novel quality is influenced by a broad spectrum of positive. Consistent with prior research such as [ 9 ] we textual features — such as those related to readability and used the absolute number of reviews to define the success sentiment — and how these perceptions vary depending of a story, which we consider broadly as popularity. This on the reader’s level of expertise. approach is based on the assumption that a high num

The growing volume of online fanfiction has also been ber of interactions, regardless of their sentiment, reflects the subject of numerous studies, either from the perspec- strong reader’s engagement. This is especially confirmed tive of text mining by using NLP or through a qualita- since in our dataset negative reviews represent less than tive lens via a manual examination. A comprehensive 1% of the total. survey of analyses in this direction has been recently To formulate our success prediction task, we estabprovided by [ 11 ]. For example, Milli and Bamman [ 12 ] lished a review threshold to classify each story as either explore the relationship between fanfiction and its orig- a success or a failure. After analyzing the distribution of inal canon, ofering one of the first empirical analyses reviews for Harry Potter texts (Figure 2), we decided to of this genre. Similarly, Sourati et al. [ 13 ] find that the exclude stories that fell in the middle of the distribution – similarity between fanfictions and their original stories those that could not be clearly defined as successes or fail— particularly in terms of emotional arcs and character ures. Consequently, stories with fewer than two reviews dynamics—correlates significantly with fanfiction’s pop- (25th percentile) were classified as failures, and those ularity. with more than six reviews (75th percentile) as successes.

In the context of Italian fanfiction, research using NLP Stories within the interquartile range were excluded from

4.1. Success Predictors A comprehensive set of features was extracted for each

story in the corpus. These features were categorized into two primary groups: linguistic features, reflecting the text’s linguistic style and structure and lexical features, representing the semantic content of the text. 4.1.1. Linguistic Features To model text’s linguistic style and structure, we drew inspiration from the linguistic profiling framework, a NLPbased methodology in which a large set of linguisticallymotivated features automatically extracted from annotated texts is used to obtain a vector-based representation of it. Such representations can be then compared across texts representative of diferent textual genres and varieties to identify the peculiarities of each [ 15 ]. For our study, we relied on Profiling-UD 1, a multilingual tool inspired by this framework, which extracts over 130 linguistic features from texts using the Universal Dependencies (UD) annotation formalism. As described in Brunato et al. [ 16 ], these features encompass a range of linguistic phenomena that can be classified into distinct groups covering e.g. shallow text features (e.g. document and sentence length, average word length), distribution of grammatical categories, inflectional morphology and the analysis. We also excluded texts published after 2020, considering them too recent for meaningful comparison.

As summarized in Table 1, the final corpora, hereafter abbreviated as HP (Harry Potter) and LOTR (The Lord of the Rings), consist of 26,032 and 932 texts, respectively.

4. Methodology

Based on the newly collected dataset and its internal distinction, we formulated the task of success prediction as a binary classification problem, that is: given a story, the model is asked to predict whether it belongs to the successful or unsuccessful class, where the two classes were defined according to the metric based on the number of reviews received by readers.

In line with our main purpose to construct a model of 1http://linguistic-profiling.italianlp.it/ syntactic properties related to local and global parse tree Table 2 depth structure. Classification Accuracy(%) of the Models. ‘Ling.’ and ‘Lex.’

These features have proven efective in tasks related refer respectively to models trained on linguistic and lexical to modeling text form, such as assessing text complex- features. The baseline corresponds to the majority class label. ity, and identifying stylistic traits of authors or author Scenario SVM Ling. EBM Ling. SVM Lex. Baseline groups. Building on previous research on a similar corpus of fanfiction [ 14 ], we hypothesize that these features can also distinguish between successful and unsuccessful fanfictions from a modeling perspective. in-domain out-domain avg. cross-time average 4.1.2. Lexical Features The second representation employed is based on lexi- better accuracy compared to linear models. Additionally, cal information and leverages the relative frequency of with a reasonable number of features, the model remains n-grams in each document. The choice of n-grams, in explainable. Each shape function can be visualized as contrast to more powerful semantic representation de- a two-dimensional plot, with the feature value on the rived from embeddings, is deliberately motivated by the x-axis and the score assigned by the shape function on desire to use lexical features that remain completely ex- the y-axis. A score greater than 0 indicates a contribution plicit. The model, henceforth referred to as the Lexical towards the positive class, whereas a score less than 0 Model, consists of the following features: indicates a contribution towards the negative class. The ifnal prediction value for a record is simply the sum of • Forms: unigrams, bigrams, and trigrams of to- the scores obtained from each shape function, potentially kens. transformed by the link function. Beyond analyzing in• Lemmas: unigrams, bigrams, and trigrams of lem- dividual shape functions, the average contribution of mas. each feature can be evaluated by taking the mean of the • Characters: sequences of characters at the be- absolute values of the assigned scores. ginning or end of words, ranging from 1 to 4 There are various algorithms within the family of characters in length. GAMs, primarily distinguished by the method used to ift the shape functions. In the case of the EBM, stan4.2. Classifiers dard gradient boosting is used. However, in each boosting iteration, the algorithm sequentially cycles through In line with our research questions, the explainability each feature, constructing each univariate shape function of the classification is crucial to evaluate the impact of through bagged boosted trees. This method has proven linguistic and lexical features on the prediction of suc- to be one of the most efective for training a GAM. cess. Therefore, two classification algorithms that allow For our study, the EBM was employed exclusively for for a precise global explanation of the predictions were experiments based on linguistic features due to the exselected. cessive dimensionality of the lexical model. This high

The first classifier employed is a linear Support Vector dimensionality would have rendered the GAM too comMachine. By fitting a decision hyperplane in the feature plex to interpret and too time-expensive to train. space, this method enables the examination of the hyperplane’s coeficients to assess the importance of the features. 5. Results and Discussion

The second algorithm employed is the Explainable Boosting Machine (EBM), which belongs to the family of Generalized Additive Models (GAMs). As explained in [ 17 ] a GAM is a model of the form:

The classification results are summarized in Table 2, for

each model and scenario under evaluation.

For models using linguistic features, in the in-domain scenario both the SVM and the EBM outperform the ma() = 0 + ∑︁ () (1) jority class baseline, with accuracies of 65.03% and 66.15% respectively, compared to 50.16% for the baseline. This where (.) is called the link function, used to model indicates that both classifiers are efectively capturing the output (e.g., the logistic function for classification). the linguistic patterns associated with success within the Each (.) is referred to as a shape function, which is a same thematic domain. univariate function modeling the relationship between For linguistic models, in the out-domain scenario the the feature and the target. performance of the SVM drops significantly, with an ac

The prediction is thus a sum of non-linear and arbi- curacy of 59.22%, whereas the EBM experiences a less trarily complex shape functions, generally resulting in tures. We provide an in-depth analysis of this model in the following section.

5.1. The Model of Success To gain a better understanding of the classification results

and identify the most influential features for predicting success, we ranked the features according to the absolute value of their weight in the EBM classifier model trained Figure 3: Classification Accuracy in the Cross-Time Setting on the entire training set. Table 3 presents an extract of the top 15 features. The analysis reveals that, in addition to basic text features such as the average document length (measured in tokens [ 1 ]) and the average word drastic decline, achieving an accuracy of 64.70%. How- length (in characters [ 2 ]), more complex linguistic propever, both classifiers still perform better than the baseline, erties play a crucial role. Among these, features related suggesting some degree of ability to generalize of the lin- to verbal predicates and verbal morphology emerge as guistic features across diferent thematic domains. particularly influential. This suggests that the syntac

The lexical model, in the in-domain scenario, achieves tic and morphological characteristics of verbs, such as an accuracy of 69.56%, outperforming all models with lin- tense, mood and person, provide valuable information guistic features, suggesting that lexical features provide for the classifier prediction, highlighting the importance a more powerful representation for in-domain success of deeper linguistic structures in building a model of prediction. Nevertheless, in the out-domain scenario, the successful writing. lexical model does not surpass the baseline, indicating While this ranking highlights the ‘global’ importance a complete lack of predictive ability. This suggests that of features, it does not explain their efect on classificalexical features, which are primarily based on the content tion. For a more detailed analysis, Figure 4 in Appendix of the specific fanfiction’s narrative universe, perform A highlights the threshold values for each of the top well within the same thematic domain but lose all sig- 15 ranked features, indicating the point at which the nificance outside of it. Conversely, linguistic features, expected classification shifts from one class to another. which focus on the form of the text, appear to be more Additionally, it provides the number of instances in the adaptable regardless of the theme. training set for each feature value. Interestingly, there

Figure 3 presents the performance over time for classi- are some features which split almost exactly the amount ifers trained with linguistic features. Additionally, two of data into two subsets. For example, the features repbaselines are shown: "Random Choice", which randomly resenting word length (char_per_tok) has a discriminant selects between the two classes, and "Maj. Class", which threshold of 4.55 characters which distinguishes successalways assigns the majority class from the correspond- ful stories – typically with longer words – from unsucing training set (2011 stories), i.e. the positive one. The cessful ones – usually with shorter words. Similarly, fearesults of the lexical model in the cross-time scenario tures related to the (morpho-)syntactic profile of the text were insignificant, as they were very similar to the "Maj. such as the percentage of conjunctions (dep_dist_conj) Class" baseline. The classifier, therefore, defaults to as- and non-finite verb forms ( verbs_form_dist_Fin) show a signing the negative class, demonstrating no predictive similar pattern. For these features, values lower than the capability. To avoid confusion, the lexical model results discriminant threshold contribute to predicting the negaare not included in this Figure. In contrast, the cross- tive class, efectively splitting the data into two groups time results for models using linguistic features are more with comparable densities. Regarding verb presence (vermeaningful: the results remain stable around an average bal_head_per_sentence), an increased use of verbs correof 62%, regardless of the dominant class in the tested lates with the unsuccessful class. This finding contradicts year and the classifier used ( avg. cross-time in Table 2). the idea that higher readability, typically conveyed by a

The cross-time scenario further suggests that linguistic predominantly verbal prose rather than a nominal one, features possess greater adaptability beyond their own is a good indicator of writing quality. However, it aligns domain, maintaining a considerable degree of general- with observations by Ashok et al. [ 2 ], who identified ization over time. Conversely, lexical features seem func- similar patterns in canonical literary novels. tional only within the specific domain of the training set, Features related to verbal morphology also show a losing all predictive power for texts from diferent do- peculiar trend. For instance, a complementary perspecmains. Overall the model that performed best on average tive emerges concerning the use of person morphology. across the three scenarios, and with the least variance Increasing the use of second person plural beyond a relain performance, is the EBM trained with linguistic fea- tively low threshold (0.4) positively afects the prediction of success, which may indicate an alignment with the Reader-Insert2 format, a specific type of fanfiction where the reader assumes the role of the protagonist, heavily relying on second-person narration. In contrast, an excessive use of the first person plural is associated with the negative class.

6. Conclusion

Understanding success factors in literary writing is an evolving area of cross-disciplinary research. This study on Italian fanfiction demonstrated the feasibility of predicting success using computational methods and explainability techniques. Notably, we found that features related to style and structure of texts show greater robustness than lexical ones across diferent domains and time periods. This suggests that the way a story is crafted may be more universally appealing than specific word choices or thematic elements.

We believe that the implications of this study extend far beyond fanfiction research. On the one hand, it provides new methodologies for analyzing online literary phenomena ofering potential contributions to digital humanities. From the NLP perspective, it could inform text generation models, potentially guiding the creation of content that resonates more efectively with readers.

Future research could explore the generalizability of these findings to other languages and genres, as well as the investigation on the dynamics of evolving reader preferences over time by also considering alternative measures to gauge success. Additionally, this study does not take into account the importance of the author; a potential future development would be to consider the 2https://fanlore.org/wiki/Reader-Insert impact of the author’s popularity and productivity on the success of their fanfiction.

[1]

Hellekson ,

Busse , Fan fiction and fan communities in the age of the internet: new essays , McFarland , 2014 .

[2]

V. G.

Ashok ,

Feng ,

Choi , Success with style: Using writing style to predict the success of novels , in: Proceedings of the 2013 conference on empirical methods in natural language processing , 2013 , pp. 1753 - 1764 .

[3]

Brottrager ,

Stahl ,

Arslan , U. Brandes, T. Weitin, Modeling and predicting literary reception. a data-rich approach to literary historical reception , Journal of Computational Literary Studies 1 ( 2022 ). URL: https://doi.org/10.48694/jcls.95.

[4]

Algee-Hewitt ,

Allison ,

Gemma ,

Heuser ,

Moretti ,

Walser , Canon/archive : large -scale dynamics in the literary field , 2018 . URL: https:// litlab.stanford.edu/LiteraryLabPamphlet11.pdf .

[5] Reviews matter: How distributed mentoring predicts lexical diversity on fanfiction .net, 2018 . URL: https://api.semanticscholar.org/CorpusID: 265096028.

[6]

Sauro , Fan fiction and informal language learning, The handbook of informal language learning ( 2019 ) 139 - 151 .

[7]

Toubia ,

J. A.

Berger ,

Eliashberg , How quantifying the shape of stories predicts their success , Proceedings of the National Academy of Sciences of the United States of America 118 ( 2021 ). URL: https: //api.semanticscholar.org/CorpusID:235648521.

[8]

J. A.

Berger ,

W. W.

Moe ,

D. A.

Schweidel , What holds attention? linguistic drivers of engagement , Journal of Marketing 87 ( 2023 ) 793 - 809 . URL: https: //api.semanticscholar.org/CorpusID:255250393.

[9]

Maharjan ,

Arevalo ,

Montes ,

F. A.

González ,

Solorio , A multi-task approach to predict likability of books , in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1 ,

Long

Papers , 2017 , pp. 1217 - 1227 .

[10]

Bizzoni ,

P. F.

Moreira ,

I. M. S.

Lassen ,

M. R.

Thomsen ,

Nielbo , A matter of perspective: Building a multi-perspective annotated dataset for the study of literary quality , in: N. Calzolari , M.- Y.

Kan , V.

Hoste , A.

Lenci , S.

Sakti , N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics , Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL , Torino , Italia, 2024 , pp. 789 - 800 . URL: https://aclanthology.org/ 2024 .lrec-main. 71 .

[11]

Nguyen ,

Zigmond ,

Glassco ,

Tran ,

P. J.

Giabbanelli , Big data meets storytelling: using machine learning to predict popular fanfiction , Social Network Analysis and Mining 14 ( 2024 ) 58 .

[12]

Milli ,

Bamman , Beyond canonical texts: A computational analysis of fanfiction , in: J. Su , K. Duh , X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Austin, Texas, 2016 , pp. 2048 - 2053 . URL: https://aclanthology.org/D16-1218. doi: 10 .18653/ v1/ D16 -1218.

[13]

Sourati Hassan Zadeh ,

Sabri ,

Chamani ,

Bahrak , Quantitative analysis of fanfictions' popularity , Social Network Analysis and Mining 12 ( 2022 ) 42 .

[14]

Mattei ,

Brunato , F. Dell'Orletta, The style of a successful story: a computational study on the fanfiction genre , in: J. Monti , F.

Dell'Orletta , F.

Tamburini (Eds.), Proceedings of the Seventh Italian Conference on Computational Linguistics , CLiC-it 2020 , Bologna, Italy, March 1- 3 , 2021 , volume 2769 of CEUR Workshop Proceedings, CEUR-WS.org , 2020 . URL: https://ceur-ws. org/ Vol- 2769 /paper_52.pdf .

[15] H. van Halteren , Linguistic profiling for authorship recognition and verification , in: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04) , Barcelona, Spain, 2004 , pp. 199 - 206 . URL: https://aclanthology.org/ P04-1026. doi: 10 .3115/1218955.1218981.

[16]

Brunato ,

Cimino ,

Dell'Orletta ,

Venturi ,

Montemagni , Profiling-UD: a tool for linguistic profiling of texts , in: N. Calzolari , F.

Béchet , P.

Blache , K.

Choukri , C.

Cieri , T.

Declerck , S.

Goggi , H.

Isahara , B.

Maegaard , J.

Mariani , H.

Mazo , A.

Moreno , J.

Odijk , S. Piperidis (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference , European Language Resources Association, Marseille, France, 2020 , pp. 7145 - 7151 . URL: https://aclanthology.org/ 2020 .lrec- 1 . 883 .

[17]

Lou ,

Caruana ,

Gehrke , Intelligible models for classification and regression , Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ( 2012 ). doi: 10 .1145/2339530.2339556.