The Style of a Successful Story: a Computational Study on the Fanfiction Genre Andrea Mattei• , Dominique Brunato , Felice Dell’Orletta • University of Pisa a.mattei3@studenti.unipi.it  Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR) ItaliaNLP Lab - www.italianlp.it {dominique.brunato, felice.dellorletta}@ilc.cnr.it Abstract a characterization of this new genre with respect to more traditional ones (Paolillo, 2001; Herring This paper presents a new corpus for the and Androutsopoulos, 2015). In the NLP commu- Italian language representative of the fan- nity, the writing style of user-generated data has fiction genre. It comprises about 55k user- been analyzed through computational stylometry generated stories inspired to the original approaches for addressing tasks broadly related to fantasy saga “Harry Potter” and published author profiling (Daelemans, 2013), such as gen- on a popular website. The corpus is large der and age detection (Peersman et al., 2011; Kop- enough to support data-driven investiga- pel et al., 2002). The vast majority of this work has tions in many directions, from more tradi- taken into account contents published on few mi- tional studies on language variation aimed croblogging platforms considered as more repre- at characterizing this genre with respect sentative of the contemporary user-generated me- to more traditional ones, to emerging top- diascape, e.g. Twitter. More recently, the atten- ics in computational social science such as tion has been oriented to the language used by the identification of factors involved in the online communities whose members share a com- success of a story. The latter is the fo- mon interest towards an object, an activity – and cus of the presented case-study, in which more in general any area of human interest – al- a wide set of multi-level linguistic features lowing scholars to shed light on the growing phe- has been automatically extracted from a nomenon of fandom (Sindoni, 2015). One of the subset of the corpus and analysed in or- most prominent expressions of fandom is fanfic- der to detect the ones which significantly tion (fanfic, fic or FF), i.e. fiction written by fans discriminate successful from unsuccessful of a TV series, movie, book etc., using existing stories characters and situations to develop new plots. In many languages dedicated websites exist where 1 Introduction users can publish their own literary works inspired to the original book they are fans of. Computational Sociolinguistics is an emergent in- terdisciplinary field aimed at exploiting compu- From a computational linguistics standpoint, tational approaches to study the relationship be- one perspective from which fanfiction has been in- tween language and society (Nguyen et al., 2016). vestigated aimed to infer the relationship between One of the primary factors driving its foundation is user-generated stories and their original source, the widespread diffusion of social media and other e.g. comparing the representation of characters ac- user-generated data available online, which has cording to their gender, as well as to model reader promoted massive research on computer-mediated reactions to stories (Smitha and Bamman, 2016). communication from several perspectives. For Inspired to that study, which was based on a large instance, scholars working in the field of genre dataset of stories mainly in English, we collect a and register variation have relied on quantitative new corpus of fanfic stories1 , which, to our knowl- approaches to inspect the peculiarities of social edge, is the first one for the Italian language. We media language, with the purpose of providing rely on this corpus to carry out an investigation 1 Copyright c 2020 for this paper by its authors. Use Terms of service forbid us to distribute this data. How- permitted under Creative Commons License Attribution 4.0 ever, the tools used to gather it are available at https: International (CC BY 4.0). //github.com/AndreMatte97/Fanfiction aimed at shedding light on the possibility of com- 3 The success of a fanfiction story: an putationally modeling the expected success of a exploratory study fanfic story, based on the assumptions of linguistic profiling and stylometry research. Based on the newly created dataset, we carried out a computational stylometric analysis aimed at 2 Dataset collection studying whether there is a connection between the success of a fanfic story and its writing style. The corpus comprises texts collected from efpfan- Such a connection has been demonstrated for more fic.net, a portal active since 2001 which allows canonical literary works covering novel and movie users to publish stories and to comment on them. domains (Ganjigunte et al., 2013; Solorio et al., The website is made up of two sections: one for 2017), showing that stylometry is a viable ap- original stories and the other for fanfictions. We proach also in scenarios different from authorship considered only the latter and we limited the col- attribution and verification. lection to stories based on the fantasy saga by the The methodological framework of our investi- British writer J.K. Rowling, “Harry Potter”. This gation is linguistic profiling (Montemagni, 2013; choice was motivated by the main purpose of our van Halteren, 2004), a NLP-based approach in analysis, i.e. characterizing the success of a novel which a large set of linguistically-motivated fea- with respect to its writing style rather than as an ef- tures automatically extracted from text are used to fect of the various subject matters it deals with. At obtain a vector-based representation of it. Such the same time, the preference given to a very popu- representations can be then compared across texts lar book allowed us to keep a consistent number of representative of different textual genres and va- potential readers and reviewers across the corpus, rieties to identify the peculiarities of each. For still having a large sample of texts to analyze. The the purpose of our analysis, we split the original data collection was performed through web scrap- dataset into two varieties corresponding to “suc- ing, with two spiders written in Python using the cessful” and “unsuccessful” stories. To define suc- open-source Scrapy framework. The first spider cess we follow an approach similar to that used by crawls the list of stories in the category of choice Solorio et al. (2017), which is based on the num- and extracts their first chapters together with some ber of reviews obtained by each story. In this re- metadata, including the URLs of the subsequent gard, we decided to include all reviews, not only chapters. The second spider takes these addresses the positive ones, which can undoubtedly testify a as input and downloads texts and additional infor- favorable attitude by the reader for the story. Two mation about all the chapters after the firsts. In the main reasons motivated our choice: first, we no- dataset created this way, the record for each chap- ticed that the overwhelming majority of collected ter includes: ID and Reference ID, combinations reviews are written to convey appreciation, with used by the website to identify the webpage of just 0.73% among a total of nearly 900k reviews each chapter. We use the ID of the first chapter as being negative; therefore, from a statistical point a reference to group together records belonging to of view, we can reasonably get rid of the distinc- the same story; Title; Rating, an estimate given by tion between various kinds of reviews and simply the author about the rawness of themes and scenes take into consideration the overall amount of feed- contained in his story; Date of posting; Author’s back received. Secondly, also a negative feedback nickname; Number of chapters in the story; Text; proves that a given story has been read and aroused Total number of reviews received by the story, di- some interest in the reader. With this in mind, we vided in positive, negative and neutral; Number define as “unsuccessful” those stories that did not of reviews received by the single chapter, as well receive any reviews, thus being largely ignored by as the text of the most recent ones. The crawlers their readers. Conversely, the “successful” cat- downloaded 54,717 stories, for a total of 19,7310 egory includes all stories with the same number chapters and a mean of approximately 3.6 chapter of chapters having received a review count higher per story, which is consistent with the one calcu- than the average of all stories of that length. We lated taking into account every entry on the web- also decided to limit the focus of this analysis to site. The obtained corpus was divided into folders, single-chapter fanfictions written before 2018, so each containing stories with the same number of as to avoid the inclusion of stories not yet con- chapters. cluded. The resulting classes comprise 2101 un- successful texts and 14486 successful ones, with a language specific tagset (XPOS *); Distribution of threshold for success amounting to 5 reviews. Ta- verbs according to tense, mood and person, both ble 1 shows an example of stories classified in the for main and auxiliar verbs (aux *; V *)). two categories. Verbal Predicate Structure: Average distribu- All texts were pre-processed by means of reg- tion of verbal roots and of verbal heads for sen- ular expressions, with the aim of removing er- tences (VerbHead); features related to the arity of rors and inconsistencies in the use of punctua- verbs (i.e. average number of dependents for ver- tion, capitalization and special characters, in order bal head, distribution of verbs by arity). to increase the reliability of automatic linguistic Global and Local Parsed Tree: Average depth annotation and the process of feature extraction, of the syntactic tree (MaxDepth); average depth of which were performed using the Profiling-UD tool embedded complement chains headed by a prepo- (Brunato et al., 2020). sition; average length of dependency links and of In what follows we first provide an overview of the maximum link (Links Len; Max Link Length); the linguistic features used for our statistical anal- relative order of the subject and object with respect ysis and then we discuss the ones that turned out to the verb; to be more prominent in successful writing. Syntactic relations: Distribution of typed UD dependency relations (dep *); 3.1 Linguistic Features Use of Subordination: Distribution of main The set of features is based on the one described and subordinate clauses (Main clause, Subord in Brunato et al. (2020) and counts more than 150 clause), average length of subordinate chains, dis- features, distributed across distinct levels of lin- tribution of subordinate chains by length. guistic annotation and computed according to the 4 Data Analysis Universal Dependencies (UD) annotation frame- work. These features have be shown to be effec- For each considered feature we calculated the tive in a variety of different scenarios, all related to average value and the standard deviation in the modeling the ‘form’ of a text, rather than the con- two classes. We the assessed whether the varia- tent: e.g., from the assessment of sentence com- tion between mean values is significant using the plexity by humans (Brunato et al., 2018) to the Wilcoxon rank sum test. We found that 57% (i.e. identification of the native language of a speaker 126 out of the 219) of features are differently dis- from his/her productions in a second language tributed in a significant way between successful (L2) (Cimino et al., 2018). Specifically, they can and unsuccessful stories. In Table 2 we report an be grouped into the following main phenomena: extract of the most interesting ones. Raw Text Features: Document length com- As it can be seen, successful stories are on aver- puted as the total number of tokens and of sen- age longer in terms of number of tokens and sen- tences ((#Tokens, #Sentences in Table 2); average tences (1, 2), although these sentences are gener- sentence length and token length, calculated in to- ally shorter (3), suggesting that readers appreciate kens and in characters, respectively (Sent length, more a plain writing style. However, when lexical Word length). factors are considered, the preference is given to Lexical Richness: Distribution of words and texts exhibiting less frequent words, as suggested lemmas belonging to the Basic Italian Vocabulary by the slightly lower distribution of words belong- (De Mauro, 2000) (BIV Tok, BIV Types) and to ing to the Basic Italian Vocabulary (5,6) and es- the internal repertories (i.e. fundamental, high us- pecially to the Fundamental one (7). Inflectional age and high availability, BIV Fund; BIV High- morphology also appears as a domain of varia- US; BIV High-AV); Type/Token Ratio, a feature tion between the two classes. Successful fanfic- of lexical variety computed as the ratio between tions employ quite more often verbs in the second the number of lexical types and the number of to- person (15), a feature typical of narrative writing kens in the first 100 and 200 words of text (TTR related to direct speech. On the contrary, we ob- Lemma); Lexical density. serve a higher distribution of third person verb, Morpho-Syntactic Information: Distribution specifically auxiliaries, both singular (14) and plu- of all grammatical categories, with respect to the ral (13), in less successful texts, which can hint at Universal part-of-speech tagset (UPOS * and the a preference for reported speech. Label Example (Italian) Example (English) Successful La città di Edimburgo era sommersa da The city of Edimburgh was flooded by una cascata d’acqua. Pioveva. Pioveva a cascade of water. It was raining. It da giorni e giorni, senza sosta. Il cielo had been raining for days and days, re- era illuminato di lampi e scosso da tuoni. lentlessly. The sky was lit by lightning Le strade erano vuote. Per la prima and shaken by thunder. The streets were volta da giorni, allo scoccare della mez- empty. For the first time in days, at zanotte, la pioggia cessò di colpo. Il the stroke of midnight, the rain stopped silenzio piombò sui quartieri che sem- abruptly. Silence fell upon the districts brarono improvvisamente più bui. E in that suddenly seemed darker. And in quel silenzio penetrante, l’unico rumore that piercing silence, the only noise that che si riusciva a distinguere era un tac- could be recognized was a faint and ir- tac-tac leggero e discontinuo. Proveniva regular tac-tac-tac. It was coming from da una finestra. La finestra di una lus- a window. The window of a luxurious suosa casa in centro, l’unica luce accesa house in the city centre, the only light a quell’ora. Joanne era davanti al com- still on at that time. Joanne was in front puter, fonte di quel tremolio e scriveva. of the computer, source of that trem- Batteva le dita sulla tastiera per alcuni bling and was writing. She tapped her istanti, poi si fermava, rileggeva, can- fingers on the keyboard for a few mo- cellava e riscriveva. Andava avanti cosı̀ ments, then stopped, reread, deleted and da giorni. I suoi occhi erano stanchi, ma rewrote. She had been going on like this la sua mente lavorava frenetica. Man- for days. Her eyes were tired, but her cava poco2 . mind was working frantically. Almost there. Unsuccessful Il cielo era tetro cosparso di nuvole che The sky was bleak strewn with clouds sembravano volere annunciare un ac- that seemed to want to announce a quazzone, il vento ulula forte facendo downpour, the wind howls loudly mak- sbattere le finestre violentemente, come ing the windows slam violently, as if se volesse gridare, liberarsi da una rab- it wanted to scream, to free itself from bia repressa. La donna dai lunghi capelli a suppressed anger. The woman with rosso scuro continuava a fissare la devas- the long dark red hair kept staring the tazione attraverso il vetro che ora si era devastation through the glass that was appannato dal suo stesso respiro. Aveva now clouded by her own breath. Her lo sguardo malinconico non più illumi- melancholic gaze was no longer lit up nato da quella dolce espressione che il by that sweet look that laughter gave riso le donava. Una mano le si poggiò her. A hand rested on her shoulder and sulla spalla e girò pian piano il volto slowly turned her face towards the loved verso la persona amata che con un ritmo one who started slowly caressing her lento cominciò ad accarezzarle le gote cheeks which took on a rosy tone on che assunsero un colorito roseo alla sua her pale skin. She closed her eyes, as pelle pallida. Chiuse gli occhi come if to savor that sweet touch that had now per assaporare quel dolce tocco che ora moved into her hair. “Don’t look beyond si era spostato nei suoi capelli. “Non the glass anymore” Whispered the voice guardare più oltre il vetro” Mormorò la with a note of concern, it belonged to voce con una nota di preoccupazione, James, husband of Lily the woman with apparteneva a James, marito di Lily la long red hair. donna dai lunghi capelli rossi3 . Table 1: An extract of a ‘successful’ story (the most reviewed one) and of an ‘unsuccessful’ one. Focusing on the distribution of morpho- ditionally we can see that balanced marks (24), i.e. syntactic categories, there is a significant differ- parenthesis and quotation marks, occur more in ence in the usage of the most common punctuation successful texts, strengthening our previous claim marks, commas (25) and full stops (26), which about a more frequent presence of direct speech are quite more frequent in highly-reviewed fan- in this class. At syntactic level, dependency rela- fictions. These features relate themselves to the tions are slightly shorter in successful texts, both previously observed difference in terms of docu- considering the average value of all dependen- ment length, as texts with more sentences neces- cies (29) and the value of the maximum depen- sarily use punctuation marks to divide them. Ad- dency link (30). In readability assessment stud- ies, longer syntactic dependencies are typically 2 The full story can be found at https://efpfanfic. found in complex texts, and the same holds for net/viewstory.php?sid=607026&i=1 deeper syntactic trees. Both these features have 3 The full story can be found at https://efpfanfic. lower values in highly-reviewed stories, suggest- net/viewstory.php?sid=27412&i=1 Unsucc Success clauses (32) in unsuccessful texts, while there is a Feature Avg (StDev) Avg (StDev) nearly even split between hypotaxis and parataxis Raw Text Features 1. # Tokens 1401 (1940) 2120 (2718) in successful ones. 2. # Sentences 78.4 (116.7) 125.1 (153.6) 3. Sent length 20.18 (12.39) 17.38 (6.43) To deepen our analysis, we also computed the 4. Word length 4.50 (.250) 4.52 (.193) coefficient of variation σ* for all features varying Lexical Features significantly between the two classes, where σ* is 5. % BIV Tok 85.7 (5.1) 84.8 (4.7) 6. % BIV Types 73.4 (7) 70.1 (7) the ratio between the standard deviation σ and the 7. % BIV Fund 61 (7.5) 57.1 (7.7) mean µ. This allowed us to evaluate the disper- 8. * % BIV High-AV 3.1 (1) 3.1 (1) sion of values around the average in a standardized 9. % BIV High-US 8.5 (2.4) 9.1 (2.5) 10. Lexical density .498 (.033) .503 (.031) way, and thus to compare the stability of features 11. TTR Lemma 100 .560 (.118) .560 (.112) pertaining to data measured on different scales. A 12. TTR Lemma 200 .433 (.114) .436 (.110) feature that is much scattered in a class of texts Morpho-Syntactic Features and highly stable in the other has a greater chance 13. % Aux 3perPl 13.2 (9) 11.8 (7.6) 14. % Aux 3perSin 54.4 (17.1) 53.2 (15.6) of being a meaningful representative of the latter. 15. % Aux 2perSin 6.3 (8.1) 7.9 (8.5) 16. % Aux Imperf. 38.5 (24.8) 31.3 (23.6) In Figure 1 we show the average variability in 17. % Aux Pres. 52.4 (26) 60 (23.6) the two classes of the four groups of features dis- 18. % V Gerund 5.7 (3.8) 6.3 (4) tinguished according to the level of annotation 19. % upos VERB 12.5 (1.8) 12.3 (1.7) 20. % upos NOUN 13.8 (2.3) 13.5 (2.1) they were extracted from. As a whole, we no- 21. % upos ADJ 4.7 (1.4) 4.6 (1.2) ticed that successful texts display less variability 22. % upos PRON 8.59 (2.24) 8.51 (2) in nearly every considered feature: 117 of them 23. % upos ADP 10.8 (1.9) 10.4 (1.8) 24. % xpos FB 1.7 (2) 2.3 (2.4) (92%) are more stable in this class. In successful 25. % xpos FF 6.5 (2.7) 7.1 (2.8) stories, features with greater stability compared to 26. % xpos FS 5.5 (2.2) 6.1 (2.1) the other class are mainly raw text, e.g. number of 27. % xpos CC 3.1 (.9) 2.9 (.8) 28. % xpos CS 1.7 (.7) 1.8 (.7) sentences, number of tokens and syntactic ones, e.g. verbal heads per sentence and average depth Syntactic Features 29. Links Len 2.78 (.438) 2.72 (.385) of syntactic trees. Among the few features which 30. *Max Link Len 1.19 (2.38) .687 (1.33) are more stable in poorly received texts, we find 30. MaxDepth 3.96 (1.45) 3.58 (.857) instead verbal predicate features, such as the dis- 32. % Main clause 48.8 (9.9) 49.9 (9) 33. % Subord clause 51.2 (9.9) 50.1 (9) tributions of past tenses and of indicative moods, 34. % Verb Head 2.63 (1.72) 2.26 (.897) in addition to the frequency of usage of cardinal 35. % dep nsubj 4.9 (1.1) 4.7 (1) numbers. The set of lexical features is instead the 36. * % dep obj 5.3 (1.1) 5.3 (1) 37. % dep obl 5.5 (1.1) 5.2 (1) most stable one for both classes. 38. % dep punct 14.2 (3.8) 16 (4.1) 39. % dep conj 4 (1.3) 3.7 (1.1) 40. % dep det 10.9 (2) 10.5 (1.8) Table 2: An extract of linguistic features varying significantly between successful and unsuccess- ful stories. All differences are significant at p < 0.001, except for features marked with an asterisk, which have p < 0.05. ing that the style of successful writing is charac- terized by a simpler syntactic structure. Interest- ingly, these results, although preliminary, go in the opposite direction to those reported by Ganji- Figure 1: Average coefficient of variation in each gunte et al. (2013) for successful literary works in class of features, both for successful and unsuc- English, which where found to be less correlated cessful texts. with text readability scores. Finally, subordinate clauses (33) occur slightly more often than main 5 Conclusion M. Koppel, S. Argamon and A. Rachel Shimoni 2002. Automatically Categorizing Written Texts by Au- In this paper, we presented a NLP-based stylo- thor Gender. Lit. Linguistic Comput., 17, 4, 401– metric analysis on the emerging genre of fanfic- 412 tion aimed at characterizing the writing style of a S. Montemagni. 2013. Tecnologie linguistico- successful story. We collected a new large-scale computazionali e monitoraggio della lingua italiana. corpus which – to the best of our knowledge – is Studi Italiani di Linguistica Teorica e Applicata the first one of this genre for Italian. We showed (SILTA), 145–172. that successful stories, defined as those receiving D. Nguyen, A.S. Doğruöz, C.P. Rosé, and F.M.G. de a number of reviews higher that the average, are Jong. 2016. Computational Sociolinguistics: A characterized by a variety of linguistic features at Survey. Computational Linguistics, Vol. 42, No. 3, different levels of granularity and that these fea- 537–593. tures are more uniformly distributed within them. John Paolillo. 2001. Language variation on Internet In the future, we would like to broad the per- Relay Chat: A social network approach. Journal of spective to other genres in order to study whether Sociolinguistics, 5, 180–213. there are linguistic predictors of successful writ- C. Peersman, W. Daelemans, and L. Van Vaerenbergh. ing which are constant across different genres, as 2011. Predicting Age and Gender in Online So- well as across concepts somehow similar to suc- cial Networks. Proceedings of the 3rd International cess, such as virality and engagement. Workshop on Search and Mining User-Generated Contents, 37–44. M.G. Sindoni 2011. ‘I Really Have No Idea References What Non-Fandom People Do with Their Lives.’ D. Brunato, A. Cimino, F. Dell’Orletta, G. Venturi and A Multimodal and Corpus-Based Analysis of Fan- S. Montemagni. 2020. Profiling-UD: a Tool for fiction. Lingue e Linguaggi, (13), 2015, 277–300, Linguistic Profiling of Texts. Proceedings of The doi.org/10.1285/i22390359v13p277. 12th Language Resources and Evaluation Confer- M. Smitha and D. Bamman. 2016. Beyond Canoni- ence, European Language Resources Association, cal Texts: A Computational Analysis of Fanfiction. 7145–7151. Proceedings of the 2016 Conference on Empirical D. Brunato, L. De Mattei, F. Dell’Orletta, B. Iavarone Methods in Natural Language Processing, EMNLP and G. Venturi. 2018. Is this Sentence Difficult? 2016, Austin, Texas, USA, November 1-4, 2016. Do you Agree? Proceedings of Conference on T. Solorio, M. Montes-y-Gómez, Suraj Maharjan, J. Empirical Methods in Natural Language Processing Ovalle and Fabio A. González. 2017. Multi-task (EMNLP 2018), 2018. Approach to Predict Likability of Books. Proceed- A. Cimino, F. Dell’Orletta, D. Brunato and G. Ven- ings of the 15th Conference of the European Chap- turi. 2018. Sentences and Documents in Native ter of the Association for Computational Linguistics, Language Identification. Proceedings of 5th Italian 1217–1227. Conference on Computational Linguistics (CLiC- IT), 1–6, Turin. H. van Halteren 2004. Linguistic profiling for author recognition and verification. Proceedings of the As- W. Daelemans. 2013 Explanation in Computational sociation for Computational Linguistics, 200–207. Stylometry. Gelbukh A. (eds) Computational Lin- guistics and Intelligent Text Processing. CICLing 2013, Lecture Notes in Computer Science, vol 7817. Springer, Berlin, Heidelberg. Tullio De Mauro. 2000. Grande dizionario italiano dell’uso (GRADIT). Torino, UTET. V. Ganjigunte Ashok, S. Feng and Yejin Choi. 2013. Success with style: Using writing style to predict the success of novels. Proceedings of the 2013 Con- ference on Empirical Methods in Natural Language Processing, 1753–1764. S.C. Herring and J. Androutsopoulos. 2015. Computer-mediated discourse 2.0. The handbook of discourse, 2nd ed. Deborah Tannen, Heidi E. Hamil- ton, Deborah Schiffrin, eds. John Wiley Sons., 1753–1764.