The Style of a Successful Story: a Computational Study on the Fanfiction Genre

The Style of a Successful Story: a Computational Study on the Fanfiction Genre AndreaMattei a.mattei3@studenti.unipi.it DominiqueBrunato FeliceDell'orletta felice.dellorletta@ilc.cnr.it University of Pisa Istituto di Linguistica Computazionale "Antonio Zampolli" (ILC CNR) ItaliaNLP Lab The Style of a Successful Story: a Computational Study on the Fanfiction Genre 386492004FC60B76A4A7AD4488CD7B55 GROBID - A machine learning software for extracting information from scholarly documents

This paper presents a new corpus for the Italian language representative of the fanfiction genre. It comprises about 55k usergenerated stories inspired to the original fantasy saga "Harry Potter" and published on a popular website. The corpus is large enough to support data-driven investigations in many directions, from more traditional studies on language variation aimed at characterizing this genre with respect to more traditional ones, to emerging topics in computational social science such as the identification of factors involved in the success of a story. The latter is the focus of the presented case-study, in which a wide set of multi-level linguistic features has been automatically extracted from a subset of the corpus and analysed in order to detect the ones which significantly discriminate successful from unsuccessful stories 1 Introduction Computational Sociolinguistics is an emergent interdisciplinary field aimed at exploiting computational approaches to study the relationship between language and society (Nguyen et al., 2016). One of the primary factors driving its foundation is the widespread diffusion of social media and other user-generated data available online, which has promoted massive research on computer-mediated communication from several perspectives. For instance, scholars working in the field of genre and register variation have relied on quantitative approaches to inspect the peculiarities of social media language, with the purpose of providing

a characterization of this new genre with respect to more traditional ones (Paolillo, 2001;Herring and Androutsopoulos, 2015). In the NLP community, the writing style of user-generated data has been analyzed through computational stylometry approaches for addressing tasks broadly related to author profiling (Daelemans, 2013), such as gender and age detection (Peersman et al., 2011;Koppel et al., 2002). The vast majority of this work has taken into account contents published on few microblogging platforms considered as more representative of the contemporary user-generated mediascape, e.g. Twitter. More recently, the attention has been oriented to the language used by online communities whose members share a common interest towards an object, an activity -and more in general any area of human interest -allowing scholars to shed light on the growing phenomenon of fandom (Sindoni, 2015). One of the most prominent expressions of fandom is fanfiction (fanfic, fic or FF), i.e. fiction written by fans of a TV series, movie, book etc., using existing characters and situations to develop new plots. In many languages dedicated websites exist where users can publish their own literary works inspired to the original book they are fans of.

From a computational linguistics standpoint, one perspective from which fanfiction has been investigated aimed to infer the relationship between user-generated stories and their original source, e.g. comparing the representation of characters according to their gender, as well as to model reader reactions to stories (Smitha and Bamman, 2016). Inspired to that study, which was based on a large dataset of stories mainly in English, we collect a new corpus of fanfic stories 1 , which, to our knowledge, is the first one for the Italian language. We rely on this corpus to carry out an investigation aimed at shedding light on the possibility of computationally modeling the expected success of a fanfic story, based on the assumptions of linguistic profiling and stylometry research.

Dataset collection

The corpus comprises texts collected from efpfanfic.net, a portal active since 2001 which allows users to publish stories and to comment on them. The website is made up of two sections: one for original stories and the other for fanfictions. We considered only the latter and we limited the collection to stories based on the fantasy saga by the British writer J.K. Rowling, "Harry Potter". This choice was motivated by the main purpose of our analysis, i.e. characterizing the success of a novel with respect to its writing style rather than as an effect of the various subject matters it deals with. At the same time, the preference given to a very popular book allowed us to keep a consistent number of potential readers and reviewers across the corpus, still having a large sample of texts to analyze. The data collection was performed through web scraping, with two spiders written in Python using the open-source Scrapy framework. The first spider crawls the list of stories in the category of choice and extracts their first chapters together with some metadata, including the URLs of the subsequent chapters. The second spider takes these addresses as input and downloads texts and additional information about all the chapters after the firsts. In the dataset created this way, the record for each chapter includes: ID and Reference ID, combinations used by the website to identify the webpage of each chapter. We use the ID of the first chapter as a reference to group together records belonging to the same story; Title; Rating, an estimate given by the author about the rawness of themes and scenes contained in his story; Date of posting; Author's nickname; Number of chapters in the story; Text; Total number of reviews received by the story, divided in positive, negative and neutral; Number of reviews received by the single chapter, as well as the text of the most recent ones. The crawlers downloaded 54,717 stories, for a total of 19,7310 chapters and a mean of approximately 3.6 chapter per story, which is consistent with the one calculated taking into account every entry on the website. The obtained corpus was divided into folders, each containing stories with the same number of chapters.

3 The success of a fanfiction story: an exploratory study

Based on the newly created dataset, we carried out a computational stylometric analysis aimed at studying whether there is a connection between the success of a fanfic story and its writing style. Such a connection has been demonstrated for more canonical literary works covering novel and movie domains (Ganjigunte et al., 2013;Solorio et al., 2017), showing that stylometry is a viable approach also in scenarios different from authorship attribution and verification.

The methodological framework of our investigation is linguistic profiling (Montemagni, 2013;van Halteren, 2004), a NLP-based approach in which a large set of linguistically-motivated features automatically extracted from text are used to obtain a vector-based representation of it. Such representations can be then compared across texts representative of different textual genres and varieties to identify the peculiarities of each. For the purpose of our analysis, we split the original dataset into two varieties corresponding to "successful" and "unsuccessful" stories. To define success we follow an approach similar to that used by Solorio et al. (2017), which is based on the number of reviews obtained by each story. In this regard, we decided to include all reviews, not only the positive ones, which can undoubtedly testify a favorable attitude by the reader for the story. Two main reasons motivated our choice: first, we noticed that the overwhelming majority of collected reviews are written to convey appreciation, with just 0.73% among a total of nearly 900k reviews being negative; therefore, from a statistical point of view, we can reasonably get rid of the distinction between various kinds of reviews and simply take into consideration the overall amount of feedback received. Secondly, also a negative feedback proves that a given story has been read and aroused some interest in the reader. With this in mind, we define as "unsuccessful" those stories that did not receive any reviews, thus being largely ignored by their readers. Conversely, the "successful" category includes all stories with the same number of chapters having received a review count higher than the average of all stories of that length. We also decided to limit the focus of this analysis to single-chapter fanfictions written before 2018, so as to avoid the inclusion of stories not yet concluded. The resulting classes comprise 2101 un-successful texts and 14486 successful ones, with a threshold for success amounting to 5 reviews. Table 1 shows an example of stories classified in the two categories.

All texts were pre-processed by means of regular expressions, with the aim of removing errors and inconsistencies in the use of punctuation, capitalization and special characters, in order to increase the reliability of automatic linguistic annotation and the process of feature extraction, which were performed using the Profiling-UD tool (Brunato et al., 2020).

In what follows we first provide an overview of the linguistic features used for our statistical analysis and then we discuss the ones that turned out to be more prominent in successful writing.

Linguistic Features

The set of features is based on the one described in Brunato et al. (2020) and counts more than 150 features, distributed across distinct levels of linguistic annotation and computed according to the Universal Dependencies (UD) annotation framework. These features have be shown to be effective in a variety of different scenarios, all related to modeling the 'form' of a text, rather than the content: e.g., from the assessment of sentence complexity by humans (Brunato et al., 2018) to the identification of the native language of a speaker from his/her productions in a second language (L2) (Cimino et al., 2018). Specifically, they can be grouped into the following main phenomena:

Raw Text Features: Document length computed as the total number of tokens and of sentences ((#Tokens, #Sentences in Table 2); average sentence length and token length, calculated in tokens and in characters, respectively (Sent length, Word length).

Lexical Richness: Distribution of words and lemmas belonging to the Basic Italian Vocabulary (De Mauro, 2000) (BIV Tok, BIV Types) and to the internal repertories (i.e. fundamental, high usage and high availability, BIV Fund; BIV High-US; BIV High-AV); Type/Token Ratio, a feature of lexical variety computed as the ratio between the number of lexical types and the number of tokens in the first 100 and 200 words of text (TTR Lemma); Lexical density.

Morpho-Syntactic Information: Distribution of all grammatical categories, with respect to the Universal part-of-speech tagset (UPOS * and the language specific tagset (XPOS *); Distribution of verbs according to tense, mood and person, both for main and auxiliar verbs (aux *; V *)).

Verbal Predicate Structure: Average distribution of verbal roots and of verbal heads for sentences (VerbHead); features related to the arity of verbs (i.e. average number of dependents for verbal head, distribution of verbs by arity).

Global and Local Parsed Tree: Average depth of the syntactic tree (MaxDepth); average depth of embedded complement chains headed by a preposition; average length of dependency links and of the maximum link (Links Len; Max Link Length); relative order of the subject and object with respect to the verb;

Syntactic relations: Distribution of typed UD dependency relations (dep *);

Use of Subordination: Distribution of main and subordinate clauses (Main clause, Subord clause), average length of subordinate chains, distribution of subordinate chains by length.

Data Analysis

For each considered feature we calculated the average value and the standard deviation in the two classes. We the assessed whether the variation between mean values is significant using the Wilcoxon rank sum test. We found that 57% (i.e. 126 out of the 219) of features are differently distributed in a significant way between successful and unsuccessful stories. In Table 2 we report an extract of the most interesting ones.

As it can be seen, successful stories are on average longer in terms of number of tokens and sentences (1, 2), although these sentences are generally shorter (3), suggesting that readers appreciate more a plain writing style. However, when lexical factors are considered, the preference is given to texts exhibiting less frequent words, as suggested by the slightly lower distribution of words belonging to the Basic Italian Vocabulary (5,6) and especially to the Fundamental one (7). Inflectional morphology also appears as a domain of variation between the two classes. Successful fanfictions employ quite more often verbs in the second person (15), a feature typical of narrative writing related to direct speech. On the contrary, we observe a higher distribution of third person verb, specifically auxiliaries, both singular ( 14) and plural (13), in less successful texts, which can hint at a preference for reported speech. For the first time in days, at the stroke of midnight, the rain stopped abruptly. Silence fell upon the districts that suddenly seemed darker. And in that piercing silence, the only noise that could be recognized was a faint and irregular tac-tac-tac. It was coming from a window. The window of a luxurious house in the city centre, the only light still on at that time. Joanne was in front of the computer, source of that trembling and was writing. She tapped her fingers on the keyboard for a few moments, then stopped, reread, deleted and rewrote. She had been going on like this for days. Her eyes were tired, but her mind was working frantically. Almost there. Unsuccessful Il cielo era tetro cosparso di nuvole che sembravano volere annunciare un acquazzone, il vento ulula forte facendo sbattere le finestre violentemente, come se volesse gridare, liberarsi da una rabbia repressa. La donna dai lunghi capelli rosso scuro continuava a fissare la devastazione attraverso il vetro che ora si era appannato dal suo stesso respiro. Aveva lo sguardo malinconico non più illuminato da quella dolce espressione che il riso le donava. Una mano le si poggiò sulla spalla e girò pian piano il volto verso la persona amata che con un ritmo lento cominciò ad accarezzarle le gote che assunsero un colorito roseo alla sua pelle pallida. Chiuse gli occhi come per assaporare quel dolce tocco che ora si era spostato nei suoi capelli. "Non guardare più oltre il vetro" Mormorò la voce con una nota di preoccupazione, apparteneva a James, marito di Lily la donna dai lunghi capelli rossi3 .

The sky was bleak strewn with clouds that seemed to want to announce a downpour, the wind howls loudly making the windows slam violently, as if it wanted to scream, to free itself from a suppressed anger. The woman with the long dark red hair kept staring the devastation through the glass that was now clouded by her own breath. Her melancholic gaze was no longer lit up by that sweet look that laughter gave her. A hand rested on her shoulder and slowly turned her face towards the loved one who started slowly caressing her cheeks which took on a rosy tone on her pale skin. She closed her eyes, as if to savor that sweet touch that had now moved into her hair. "Don't look beyond the glass anymore" Whispered the voice with a note of concern, it belonged to James, husband of Lily the woman with long red hair.

Table 1: An extract of a 'successful' story (the most reviewed one) and of an 'unsuccessful' one.

Focusing on the distribution of morphosyntactic categories, there is a significant difference in the usage of the most common punctuation marks, commas (25) and full stops (26), which are quite more frequent in highly-reviewed fanfictions. These features relate themselves to the previously observed difference in terms of document length, as texts with more sentences necessarily use punctuation marks to divide them. Ad-ditionally we can see that balanced marks (24), i.e. parenthesis and quotation marks, occur more in successful texts, strengthening our previous claim about a more frequent presence of direct speech in this class. At syntactic level, dependency relations are slightly shorter in successful texts, both considering the average value of all dependencies (29) and the value of the maximum dependency link (30). In readability assessment studies, longer syntactic dependencies are typically found in complex texts, and the same holds for deeper syntactic trees. Both these features have lower values in highly-reviewed stories, suggest- Table 2: An extract of linguistic features varying significantly between successful and unsuccessful stories. All differences are significant at p < 0.001, except for features marked with an asterisk, which have p < 0.05.

ing that the style of successful writing is characterized by a simpler syntactic structure. Interestingly, these results, although preliminary, go in the opposite direction to those reported by Ganjigunte et al. ( 2013) for successful literary works in English, which where found to be less correlated with text readability scores. Finally, subordinate clauses (33) occur slightly more often than main clauses (32) in unsuccessful texts, while there is a nearly even split between hypotaxis and parataxis in successful ones.

To deepen our analysis, we also computed the coefficient of variation σ* for all features varying significantly between the two classes, where σ* is the ratio between the standard deviation σ and the mean µ. This allowed us to evaluate the dispersion of values around the average in a standardized way, and thus to compare the stability of features pertaining to data measured on different scales. A feature that is much scattered in a class of texts and highly stable in the other has a greater chance of being a meaningful representative of the latter.

In Figure 1 we show the average variability in the two classes of the four groups of features distinguished according to the level of annotation they were extracted from. As a whole, we noticed that successful texts display less variability in nearly every considered feature: 117 of them (92%) are more stable in this class. In successful stories, features with greater stability compared to the other class are mainly raw text, e.g. number of sentences, number of tokens and syntactic ones, e.g. verbal heads per sentence and average depth of syntactic trees. Among the few features which are more stable in poorly received texts, we find instead verbal predicate features, such as the distributions of past tenses and of indicative moods, in addition to the frequency of usage of cardinal numbers. The set of lexical features is instead the most stable one for both classes. In this paper, we presented a NLP-based stylometric analysis on the emerging genre of fanfiction aimed at characterizing the writing style of a successful story. We collected a new large-scale corpus which -to the best of our knowledge -is the first one of this genre for Italian. We showed that successful stories, defined as those receiving a number of reviews higher that the average, are characterized by a variety of linguistic features at different levels of granularity and that these features are more uniformly distributed within them.

In the future, we would like to broad the perspective to other genres in order to study whether there are linguistic predictors of successful writing which are constant across different genres, as well as across concepts somehow similar to success, such as virality and engagement.

Figure 1 :1Figure 1: Average coefficient of variation in each class of features, both for successful and unsuccessful texts. Terms of service forbid us to distribute this data. However, the tools used to gather it are available at https: //github.com/AndreMatte97/Fanfiction The full story can be found at https://efpfanfic. net/viewstory.php?sid=607026&i=1 The full story can be found at https://efpfanfic. net/viewstory.php?sid=27412&i=1

Profiling-UD: a Tool for Linguistic Profiling of Texts DBrunato ACimino FDell'orletta GVenturi SMontemagni Proceedings of The 12th Language Resources and Evaluation Conference, European Language Resources Association The 12th Language Resources and Evaluation Conference, European Language Resources Association 2020 Is this Sentence Difficult? Do you Agree? DBrunato LDe Mattei FDell'orletta BIavarone GVenturi Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP Conference on Empirical Methods in Natural Language Processing (EMNLP 2018. 2018. 2018 Sentences and Documents in Native Language Identification ACimino FDell'orletta DBrunato GVenturi Proceedings of 5th Italian Conference on Computational Linguistics (CLiC-IT), 1-6 5th Italian Conference on Computational Linguistics (CLiC-IT), 1-6

Turin

2018 Explanation in Computational Stylometry WDaelemans Computational Linguistics and Intelligent Text Processing Lecture Notes in Computer Science AGelbukh

Berlin, Heidelberg

Springer 2013 7817 CICLing 2013 TullioDe Mauro Grande dizionario italiano dell'uso (GRADIT)

Torino, UTET

2000 Success with style: Using writing style to predict the success of novels VGanjigunte Ashok SFeng YejinChoi Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing the 2013 Conference on Empirical Methods in Natural Language Processing 2013 Computer-mediated discourse 2.0. The handbook of discourse SCHerring JAndroutsopoulos Deborah Tannen, Heidi E. Hamilton, Deborah Schiffrin 2015 John Wiley Sons 2nd ed Automatically Categorizing Written Texts by Author Gender MKoppel SArgamon ARachelShimoni Lit. Linguistic Comput 17 4 2002 Tecnologie linguisticocomputazionali e monitoraggio della lingua italiana SMontemagni Studi Italiani di Linguistica Teorica e Applicata 2013 SILTA Computational Sociolinguistics: A Survey DNguyen ASDogruöz CPRosé FM GDe Jong Computational Linguistics 42 3 2016 Language variation on Internet Relay Chat: A social network approach JohnPaolillo Journal of Sociolinguistics 5 2001 Predicting Age and Gender in Online Social Networks CPeersman WDaelemans LVan Vaerenbergh Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents the 3rd International Workshop on Search and Mining User-Generated Contents 2011 'I Really Have No Idea What Non-Fandom People Do with Their Lives.' A Multimodal and Corpus-Based Analysis of Fanfiction MGSindoni .org/10.1285/i22390359v13p277 Lingue e Linguaggi 13 2011. 2015 Beyond Canonical Texts: A Computational Analysis of Fanfiction MSmitha DBamman Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016 the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016

Austin, Texas, USA

2016. November 1-4, 2016 Multi-task Approach to Predict Likability of Books TSolorio MMontes-Y-Gómez JSuraj Maharjan FabioAOvalle González Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics the 15th Conference of the European Chapter of the Association for Computational Linguistics 2017 Linguistic profiling for author recognition and verification HVan Halteren Proceedings of the Association for Computational Linguistics the Association for Computational Linguistics 2004