=Paper=
{{Paper
|id=Vol-2769/52
|storemode=property
|title=The Style of a Successful Story: a Computational Study on the Fanfiction Genre
|pdfUrl=https://ceur-ws.org/Vol-2769/paper_52.pdf
|volume=Vol-2769
|authors=Andrea Mattei,Dominique Brunato,Felice Dell'Orletta
|dblpUrl=https://dblp.org/rec/conf/clic-it/MatteiBD20
}}
==The Style of a Successful Story: a Computational Study on the Fanfiction Genre==
The Style of a Successful Story:
a Computational Study on the Fanfiction Genre
Andrea Mattei• , Dominique Brunato , Felice Dell’Orletta
•
University of Pisa
a.mattei3@studenti.unipi.it
Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR)
ItaliaNLP Lab - www.italianlp.it
{dominique.brunato, felice.dellorletta}@ilc.cnr.it
Abstract a characterization of this new genre with respect
to more traditional ones (Paolillo, 2001; Herring
This paper presents a new corpus for the and Androutsopoulos, 2015). In the NLP commu-
Italian language representative of the fan- nity, the writing style of user-generated data has
fiction genre. It comprises about 55k user- been analyzed through computational stylometry
generated stories inspired to the original approaches for addressing tasks broadly related to
fantasy saga “Harry Potter” and published author profiling (Daelemans, 2013), such as gen-
on a popular website. The corpus is large der and age detection (Peersman et al., 2011; Kop-
enough to support data-driven investiga- pel et al., 2002). The vast majority of this work has
tions in many directions, from more tradi- taken into account contents published on few mi-
tional studies on language variation aimed croblogging platforms considered as more repre-
at characterizing this genre with respect sentative of the contemporary user-generated me-
to more traditional ones, to emerging top- diascape, e.g. Twitter. More recently, the atten-
ics in computational social science such as tion has been oriented to the language used by
the identification of factors involved in the online communities whose members share a com-
success of a story. The latter is the fo- mon interest towards an object, an activity – and
cus of the presented case-study, in which more in general any area of human interest – al-
a wide set of multi-level linguistic features lowing scholars to shed light on the growing phe-
has been automatically extracted from a nomenon of fandom (Sindoni, 2015). One of the
subset of the corpus and analysed in or- most prominent expressions of fandom is fanfic-
der to detect the ones which significantly tion (fanfic, fic or FF), i.e. fiction written by fans
discriminate successful from unsuccessful of a TV series, movie, book etc., using existing
stories characters and situations to develop new plots. In
many languages dedicated websites exist where
1 Introduction users can publish their own literary works inspired
to the original book they are fans of.
Computational Sociolinguistics is an emergent in-
terdisciplinary field aimed at exploiting compu- From a computational linguistics standpoint,
tational approaches to study the relationship be- one perspective from which fanfiction has been in-
tween language and society (Nguyen et al., 2016). vestigated aimed to infer the relationship between
One of the primary factors driving its foundation is user-generated stories and their original source,
the widespread diffusion of social media and other e.g. comparing the representation of characters ac-
user-generated data available online, which has cording to their gender, as well as to model reader
promoted massive research on computer-mediated reactions to stories (Smitha and Bamman, 2016).
communication from several perspectives. For Inspired to that study, which was based on a large
instance, scholars working in the field of genre dataset of stories mainly in English, we collect a
and register variation have relied on quantitative new corpus of fanfic stories1 , which, to our knowl-
approaches to inspect the peculiarities of social edge, is the first one for the Italian language. We
media language, with the purpose of providing rely on this corpus to carry out an investigation
1
Copyright c 2020 for this paper by its authors. Use Terms of service forbid us to distribute this data. How-
permitted under Creative Commons License Attribution 4.0 ever, the tools used to gather it are available at https:
International (CC BY 4.0). //github.com/AndreMatte97/Fanfiction
aimed at shedding light on the possibility of com- 3 The success of a fanfiction story: an
putationally modeling the expected success of a exploratory study
fanfic story, based on the assumptions of linguistic
profiling and stylometry research. Based on the newly created dataset, we carried
out a computational stylometric analysis aimed at
2 Dataset collection studying whether there is a connection between
the success of a fanfic story and its writing style.
The corpus comprises texts collected from efpfan- Such a connection has been demonstrated for more
fic.net, a portal active since 2001 which allows canonical literary works covering novel and movie
users to publish stories and to comment on them. domains (Ganjigunte et al., 2013; Solorio et al.,
The website is made up of two sections: one for 2017), showing that stylometry is a viable ap-
original stories and the other for fanfictions. We proach also in scenarios different from authorship
considered only the latter and we limited the col- attribution and verification.
lection to stories based on the fantasy saga by the The methodological framework of our investi-
British writer J.K. Rowling, “Harry Potter”. This gation is linguistic profiling (Montemagni, 2013;
choice was motivated by the main purpose of our van Halteren, 2004), a NLP-based approach in
analysis, i.e. characterizing the success of a novel which a large set of linguistically-motivated fea-
with respect to its writing style rather than as an ef- tures automatically extracted from text are used to
fect of the various subject matters it deals with. At obtain a vector-based representation of it. Such
the same time, the preference given to a very popu- representations can be then compared across texts
lar book allowed us to keep a consistent number of representative of different textual genres and va-
potential readers and reviewers across the corpus, rieties to identify the peculiarities of each. For
still having a large sample of texts to analyze. The the purpose of our analysis, we split the original
data collection was performed through web scrap- dataset into two varieties corresponding to “suc-
ing, with two spiders written in Python using the cessful” and “unsuccessful” stories. To define suc-
open-source Scrapy framework. The first spider cess we follow an approach similar to that used by
crawls the list of stories in the category of choice Solorio et al. (2017), which is based on the num-
and extracts their first chapters together with some ber of reviews obtained by each story. In this re-
metadata, including the URLs of the subsequent gard, we decided to include all reviews, not only
chapters. The second spider takes these addresses the positive ones, which can undoubtedly testify a
as input and downloads texts and additional infor- favorable attitude by the reader for the story. Two
mation about all the chapters after the firsts. In the main reasons motivated our choice: first, we no-
dataset created this way, the record for each chap- ticed that the overwhelming majority of collected
ter includes: ID and Reference ID, combinations reviews are written to convey appreciation, with
used by the website to identify the webpage of just 0.73% among a total of nearly 900k reviews
each chapter. We use the ID of the first chapter as being negative; therefore, from a statistical point
a reference to group together records belonging to of view, we can reasonably get rid of the distinc-
the same story; Title; Rating, an estimate given by tion between various kinds of reviews and simply
the author about the rawness of themes and scenes take into consideration the overall amount of feed-
contained in his story; Date of posting; Author’s back received. Secondly, also a negative feedback
nickname; Number of chapters in the story; Text; proves that a given story has been read and aroused
Total number of reviews received by the story, di- some interest in the reader. With this in mind, we
vided in positive, negative and neutral; Number define as “unsuccessful” those stories that did not
of reviews received by the single chapter, as well receive any reviews, thus being largely ignored by
as the text of the most recent ones. The crawlers their readers. Conversely, the “successful” cat-
downloaded 54,717 stories, for a total of 19,7310 egory includes all stories with the same number
chapters and a mean of approximately 3.6 chapter of chapters having received a review count higher
per story, which is consistent with the one calcu- than the average of all stories of that length. We
lated taking into account every entry on the web- also decided to limit the focus of this analysis to
site. The obtained corpus was divided into folders, single-chapter fanfictions written before 2018, so
each containing stories with the same number of as to avoid the inclusion of stories not yet con-
chapters. cluded. The resulting classes comprise 2101 un-
successful texts and 14486 successful ones, with a language specific tagset (XPOS *); Distribution of
threshold for success amounting to 5 reviews. Ta- verbs according to tense, mood and person, both
ble 1 shows an example of stories classified in the for main and auxiliar verbs (aux *; V *)).
two categories. Verbal Predicate Structure: Average distribu-
All texts were pre-processed by means of reg- tion of verbal roots and of verbal heads for sen-
ular expressions, with the aim of removing er- tences (VerbHead); features related to the arity of
rors and inconsistencies in the use of punctua- verbs (i.e. average number of dependents for ver-
tion, capitalization and special characters, in order bal head, distribution of verbs by arity).
to increase the reliability of automatic linguistic Global and Local Parsed Tree: Average depth
annotation and the process of feature extraction, of the syntactic tree (MaxDepth); average depth of
which were performed using the Profiling-UD tool embedded complement chains headed by a prepo-
(Brunato et al., 2020). sition; average length of dependency links and of
In what follows we first provide an overview of the maximum link (Links Len; Max Link Length);
the linguistic features used for our statistical anal- relative order of the subject and object with respect
ysis and then we discuss the ones that turned out to the verb;
to be more prominent in successful writing. Syntactic relations: Distribution of typed UD
dependency relations (dep *);
3.1 Linguistic Features Use of Subordination: Distribution of main
The set of features is based on the one described and subordinate clauses (Main clause, Subord
in Brunato et al. (2020) and counts more than 150 clause), average length of subordinate chains, dis-
features, distributed across distinct levels of lin- tribution of subordinate chains by length.
guistic annotation and computed according to the
4 Data Analysis
Universal Dependencies (UD) annotation frame-
work. These features have be shown to be effec- For each considered feature we calculated the
tive in a variety of different scenarios, all related to average value and the standard deviation in the
modeling the ‘form’ of a text, rather than the con- two classes. We the assessed whether the varia-
tent: e.g., from the assessment of sentence com- tion between mean values is significant using the
plexity by humans (Brunato et al., 2018) to the Wilcoxon rank sum test. We found that 57% (i.e.
identification of the native language of a speaker 126 out of the 219) of features are differently dis-
from his/her productions in a second language tributed in a significant way between successful
(L2) (Cimino et al., 2018). Specifically, they can and unsuccessful stories. In Table 2 we report an
be grouped into the following main phenomena: extract of the most interesting ones.
Raw Text Features: Document length com- As it can be seen, successful stories are on aver-
puted as the total number of tokens and of sen- age longer in terms of number of tokens and sen-
tences ((#Tokens, #Sentences in Table 2); average tences (1, 2), although these sentences are gener-
sentence length and token length, calculated in to- ally shorter (3), suggesting that readers appreciate
kens and in characters, respectively (Sent length, more a plain writing style. However, when lexical
Word length). factors are considered, the preference is given to
Lexical Richness: Distribution of words and texts exhibiting less frequent words, as suggested
lemmas belonging to the Basic Italian Vocabulary by the slightly lower distribution of words belong-
(De Mauro, 2000) (BIV Tok, BIV Types) and to ing to the Basic Italian Vocabulary (5,6) and es-
the internal repertories (i.e. fundamental, high us- pecially to the Fundamental one (7). Inflectional
age and high availability, BIV Fund; BIV High- morphology also appears as a domain of varia-
US; BIV High-AV); Type/Token Ratio, a feature tion between the two classes. Successful fanfic-
of lexical variety computed as the ratio between tions employ quite more often verbs in the second
the number of lexical types and the number of to- person (15), a feature typical of narrative writing
kens in the first 100 and 200 words of text (TTR related to direct speech. On the contrary, we ob-
Lemma); Lexical density. serve a higher distribution of third person verb,
Morpho-Syntactic Information: Distribution specifically auxiliaries, both singular (14) and plu-
of all grammatical categories, with respect to the ral (13), in less successful texts, which can hint at
Universal part-of-speech tagset (UPOS * and the a preference for reported speech.
Label Example (Italian) Example (English)
Successful La città di Edimburgo era sommersa da The city of Edimburgh was flooded by
una cascata d’acqua. Pioveva. Pioveva a cascade of water. It was raining. It
da giorni e giorni, senza sosta. Il cielo had been raining for days and days, re-
era illuminato di lampi e scosso da tuoni. lentlessly. The sky was lit by lightning
Le strade erano vuote. Per la prima and shaken by thunder. The streets were
volta da giorni, allo scoccare della mez- empty. For the first time in days, at
zanotte, la pioggia cessò di colpo. Il the stroke of midnight, the rain stopped
silenzio piombò sui quartieri che sem- abruptly. Silence fell upon the districts
brarono improvvisamente più bui. E in that suddenly seemed darker. And in
quel silenzio penetrante, l’unico rumore that piercing silence, the only noise that
che si riusciva a distinguere era un tac- could be recognized was a faint and ir-
tac-tac leggero e discontinuo. Proveniva regular tac-tac-tac. It was coming from
da una finestra. La finestra di una lus- a window. The window of a luxurious
suosa casa in centro, l’unica luce accesa house in the city centre, the only light
a quell’ora. Joanne era davanti al com- still on at that time. Joanne was in front
puter, fonte di quel tremolio e scriveva. of the computer, source of that trem-
Batteva le dita sulla tastiera per alcuni bling and was writing. She tapped her
istanti, poi si fermava, rileggeva, can- fingers on the keyboard for a few mo-
cellava e riscriveva. Andava avanti cosı̀ ments, then stopped, reread, deleted and
da giorni. I suoi occhi erano stanchi, ma rewrote. She had been going on like this
la sua mente lavorava frenetica. Man- for days. Her eyes were tired, but her
cava poco2 . mind was working frantically. Almost
there.
Unsuccessful Il cielo era tetro cosparso di nuvole che The sky was bleak strewn with clouds
sembravano volere annunciare un ac- that seemed to want to announce a
quazzone, il vento ulula forte facendo downpour, the wind howls loudly mak-
sbattere le finestre violentemente, come ing the windows slam violently, as if
se volesse gridare, liberarsi da una rab- it wanted to scream, to free itself from
bia repressa. La donna dai lunghi capelli a suppressed anger. The woman with
rosso scuro continuava a fissare la devas- the long dark red hair kept staring the
tazione attraverso il vetro che ora si era devastation through the glass that was
appannato dal suo stesso respiro. Aveva now clouded by her own breath. Her
lo sguardo malinconico non più illumi- melancholic gaze was no longer lit up
nato da quella dolce espressione che il by that sweet look that laughter gave
riso le donava. Una mano le si poggiò her. A hand rested on her shoulder and
sulla spalla e girò pian piano il volto slowly turned her face towards the loved
verso la persona amata che con un ritmo one who started slowly caressing her
lento cominciò ad accarezzarle le gote cheeks which took on a rosy tone on
che assunsero un colorito roseo alla sua her pale skin. She closed her eyes, as
pelle pallida. Chiuse gli occhi come if to savor that sweet touch that had now
per assaporare quel dolce tocco che ora moved into her hair. “Don’t look beyond
si era spostato nei suoi capelli. “Non the glass anymore” Whispered the voice
guardare più oltre il vetro” Mormorò la with a note of concern, it belonged to
voce con una nota di preoccupazione, James, husband of Lily the woman with
apparteneva a James, marito di Lily la long red hair.
donna dai lunghi capelli rossi3 .
Table 1: An extract of a ‘successful’ story (the most reviewed one) and of an ‘unsuccessful’ one.
Focusing on the distribution of morpho- ditionally we can see that balanced marks (24), i.e.
syntactic categories, there is a significant differ- parenthesis and quotation marks, occur more in
ence in the usage of the most common punctuation successful texts, strengthening our previous claim
marks, commas (25) and full stops (26), which about a more frequent presence of direct speech
are quite more frequent in highly-reviewed fan- in this class. At syntactic level, dependency rela-
fictions. These features relate themselves to the tions are slightly shorter in successful texts, both
previously observed difference in terms of docu- considering the average value of all dependen-
ment length, as texts with more sentences neces- cies (29) and the value of the maximum depen-
sarily use punctuation marks to divide them. Ad- dency link (30). In readability assessment stud-
ies, longer syntactic dependencies are typically
2
The full story can be found at https://efpfanfic. found in complex texts, and the same holds for
net/viewstory.php?sid=607026&i=1 deeper syntactic trees. Both these features have
3
The full story can be found at https://efpfanfic. lower values in highly-reviewed stories, suggest-
net/viewstory.php?sid=27412&i=1
Unsucc Success clauses (32) in unsuccessful texts, while there is a
Feature Avg (StDev) Avg (StDev)
nearly even split between hypotaxis and parataxis
Raw Text Features
1. # Tokens 1401 (1940) 2120 (2718) in successful ones.
2. # Sentences 78.4 (116.7) 125.1 (153.6)
3. Sent length 20.18 (12.39) 17.38 (6.43) To deepen our analysis, we also computed the
4. Word length 4.50 (.250) 4.52 (.193) coefficient of variation σ* for all features varying
Lexical Features significantly between the two classes, where σ* is
5. % BIV Tok 85.7 (5.1) 84.8 (4.7)
6. % BIV Types 73.4 (7) 70.1 (7) the ratio between the standard deviation σ and the
7. % BIV Fund 61 (7.5) 57.1 (7.7) mean µ. This allowed us to evaluate the disper-
8. * % BIV High-AV 3.1 (1) 3.1 (1) sion of values around the average in a standardized
9. % BIV High-US 8.5 (2.4) 9.1 (2.5)
10. Lexical density .498 (.033) .503 (.031)
way, and thus to compare the stability of features
11. TTR Lemma 100 .560 (.118) .560 (.112) pertaining to data measured on different scales. A
12. TTR Lemma 200 .433 (.114) .436 (.110) feature that is much scattered in a class of texts
Morpho-Syntactic Features and highly stable in the other has a greater chance
13. % Aux 3perPl 13.2 (9) 11.8 (7.6)
14. % Aux 3perSin 54.4 (17.1) 53.2 (15.6)
of being a meaningful representative of the latter.
15. % Aux 2perSin 6.3 (8.1) 7.9 (8.5)
16. % Aux Imperf. 38.5 (24.8) 31.3 (23.6) In Figure 1 we show the average variability in
17. % Aux Pres. 52.4 (26) 60 (23.6) the two classes of the four groups of features dis-
18. % V Gerund 5.7 (3.8) 6.3 (4) tinguished according to the level of annotation
19. % upos VERB 12.5 (1.8) 12.3 (1.7)
20. % upos NOUN 13.8 (2.3) 13.5 (2.1) they were extracted from. As a whole, we no-
21. % upos ADJ 4.7 (1.4) 4.6 (1.2) ticed that successful texts display less variability
22. % upos PRON 8.59 (2.24) 8.51 (2) in nearly every considered feature: 117 of them
23. % upos ADP 10.8 (1.9) 10.4 (1.8)
24. % xpos FB 1.7 (2) 2.3 (2.4) (92%) are more stable in this class. In successful
25. % xpos FF 6.5 (2.7) 7.1 (2.8) stories, features with greater stability compared to
26. % xpos FS 5.5 (2.2) 6.1 (2.1) the other class are mainly raw text, e.g. number of
27. % xpos CC 3.1 (.9) 2.9 (.8)
28. % xpos CS 1.7 (.7) 1.8 (.7)
sentences, number of tokens and syntactic ones,
e.g. verbal heads per sentence and average depth
Syntactic Features
29. Links Len 2.78 (.438) 2.72 (.385) of syntactic trees. Among the few features which
30. *Max Link Len 1.19 (2.38) .687 (1.33) are more stable in poorly received texts, we find
30. MaxDepth 3.96 (1.45) 3.58 (.857) instead verbal predicate features, such as the dis-
32. % Main clause 48.8 (9.9) 49.9 (9)
33. % Subord clause 51.2 (9.9) 50.1 (9) tributions of past tenses and of indicative moods,
34. % Verb Head 2.63 (1.72) 2.26 (.897) in addition to the frequency of usage of cardinal
35. % dep nsubj 4.9 (1.1) 4.7 (1) numbers. The set of lexical features is instead the
36. * % dep obj 5.3 (1.1) 5.3 (1)
37. % dep obl 5.5 (1.1) 5.2 (1)
most stable one for both classes.
38. % dep punct 14.2 (3.8) 16 (4.1)
39. % dep conj 4 (1.3) 3.7 (1.1)
40. % dep det 10.9 (2) 10.5 (1.8)
Table 2: An extract of linguistic features varying
significantly between successful and unsuccess-
ful stories. All differences are significant at p <
0.001, except for features marked with an asterisk,
which have p < 0.05.
ing that the style of successful writing is charac-
terized by a simpler syntactic structure. Interest-
ingly, these results, although preliminary, go in
the opposite direction to those reported by Ganji- Figure 1: Average coefficient of variation in each
gunte et al. (2013) for successful literary works in class of features, both for successful and unsuc-
English, which where found to be less correlated cessful texts.
with text readability scores. Finally, subordinate
clauses (33) occur slightly more often than main
5 Conclusion M. Koppel, S. Argamon and A. Rachel Shimoni 2002.
Automatically Categorizing Written Texts by Au-
In this paper, we presented a NLP-based stylo- thor Gender. Lit. Linguistic Comput., 17, 4, 401–
metric analysis on the emerging genre of fanfic- 412
tion aimed at characterizing the writing style of a S. Montemagni. 2013. Tecnologie linguistico-
successful story. We collected a new large-scale computazionali e monitoraggio della lingua italiana.
corpus which – to the best of our knowledge – is Studi Italiani di Linguistica Teorica e Applicata
the first one of this genre for Italian. We showed (SILTA), 145–172.
that successful stories, defined as those receiving D. Nguyen, A.S. Doğruöz, C.P. Rosé, and F.M.G. de
a number of reviews higher that the average, are Jong. 2016. Computational Sociolinguistics: A
characterized by a variety of linguistic features at Survey. Computational Linguistics, Vol. 42, No. 3,
different levels of granularity and that these fea- 537–593.
tures are more uniformly distributed within them. John Paolillo. 2001. Language variation on Internet
In the future, we would like to broad the per- Relay Chat: A social network approach. Journal of
spective to other genres in order to study whether Sociolinguistics, 5, 180–213.
there are linguistic predictors of successful writ- C. Peersman, W. Daelemans, and L. Van Vaerenbergh.
ing which are constant across different genres, as 2011. Predicting Age and Gender in Online So-
well as across concepts somehow similar to suc- cial Networks. Proceedings of the 3rd International
cess, such as virality and engagement. Workshop on Search and Mining User-Generated
Contents, 37–44.
M.G. Sindoni 2011. ‘I Really Have No Idea
References What Non-Fandom People Do with Their Lives.’
D. Brunato, A. Cimino, F. Dell’Orletta, G. Venturi and A Multimodal and Corpus-Based Analysis of Fan-
S. Montemagni. 2020. Profiling-UD: a Tool for fiction. Lingue e Linguaggi, (13), 2015, 277–300,
Linguistic Profiling of Texts. Proceedings of The doi.org/10.1285/i22390359v13p277.
12th Language Resources and Evaluation Confer- M. Smitha and D. Bamman. 2016. Beyond Canoni-
ence, European Language Resources Association, cal Texts: A Computational Analysis of Fanfiction.
7145–7151. Proceedings of the 2016 Conference on Empirical
D. Brunato, L. De Mattei, F. Dell’Orletta, B. Iavarone Methods in Natural Language Processing, EMNLP
and G. Venturi. 2018. Is this Sentence Difficult? 2016, Austin, Texas, USA, November 1-4, 2016.
Do you Agree? Proceedings of Conference on
T. Solorio, M. Montes-y-Gómez, Suraj Maharjan, J.
Empirical Methods in Natural Language Processing
Ovalle and Fabio A. González. 2017. Multi-task
(EMNLP 2018), 2018.
Approach to Predict Likability of Books. Proceed-
A. Cimino, F. Dell’Orletta, D. Brunato and G. Ven- ings of the 15th Conference of the European Chap-
turi. 2018. Sentences and Documents in Native ter of the Association for Computational Linguistics,
Language Identification. Proceedings of 5th Italian 1217–1227.
Conference on Computational Linguistics (CLiC-
IT), 1–6, Turin. H. van Halteren 2004. Linguistic profiling for author
recognition and verification. Proceedings of the As-
W. Daelemans. 2013 Explanation in Computational sociation for Computational Linguistics, 200–207.
Stylometry. Gelbukh A. (eds) Computational Lin-
guistics and Intelligent Text Processing. CICLing
2013, Lecture Notes in Computer Science, vol 7817.
Springer, Berlin, Heidelberg.
Tullio De Mauro. 2000. Grande dizionario italiano
dell’uso (GRADIT). Torino, UTET.
V. Ganjigunte Ashok, S. Feng and Yejin Choi. 2013.
Success with style: Using writing style to predict
the success of novels. Proceedings of the 2013 Con-
ference on Empirical Methods in Natural Language
Processing, 1753–1764.
S.C. Herring and J. Androutsopoulos. 2015.
Computer-mediated discourse 2.0. The handbook of
discourse, 2nd ed. Deborah Tannen, Heidi E. Hamil-
ton, Deborah Schiffrin, eds. John Wiley Sons.,
1753–1764.