<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Hits or Misses? A Linguistically Explainable Formula for Fanfiction Success</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Giulio</forename><surname>Leonardi</surname></persName>
							<email>g.leonardi5@studenti.unipi.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Pisa</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Dominique</forename><surname>Brunato</surname></persName>
							<email>dominique.brunato@ilc.cnr.it</email>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Istituto di Linguistica Computazionale &quot;Antonio Zampolli&quot;</orgName>
								<orgName type="department" key="dep2">ItaliaNLP Lab</orgName>
								<address>
									<settlement>Pisa</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Felice</forename><surname>Dell'orletta</surname></persName>
							<email>felice.dellorletta@ilc.cnr.it</email>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Istituto di Linguistica Computazionale &quot;Antonio Zampolli&quot;</orgName>
								<orgName type="department" key="dep2">ItaliaNLP Lab</orgName>
								<address>
									<settlement>Pisa</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="department">Tenth Italian Conference on Computational Linguistics</orgName>
								<address>
									<addrLine>Dec 04 -06</addrLine>
									<postCode>2024</postCode>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Hits or Misses? A Linguistically Explainable Formula for Fanfiction Success</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">1E2A665CED8FDFD14C927EB9CF6416D8</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:35+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>fanfiction</term>
					<term>Italian corpus</term>
					<term>success prediction</term>
					<term>linguistic features</term>
					<term>Explainable Boosting Machine</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This study presents a computational analysis of Italian fanfiction, aiming to construct an interpretable model of successful writing within this emerging literary domain. Leveraging explicit features that capture both linguistic style and semantic content, we demonstrate the feasibility of automatically predicting successful writing in fanfiction and we identify a set of robust linguistic predictors that maintain their predictive power across diverse topics and time periods, offering insights into the universal aspects of engaging storytelling. This approach not only enhances our understanding of fanfiction as a genre but also offers potential applications in broader literary analysis and content creation.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction and Motivation</head><p>The growing proliferation of online literary content has led to the emergence of new genres and storytelling forms, with fanfiction being particularly popular among teens and young adults. Fanfiction consists of stories created by fans (mostly hobby authors) that extend or alter the narrative of existing popular media like books, movies, comics or games, and represents a significant portion of user-generated content on the web <ref type="bibr" target="#b0">[1]</ref>. In recent years, the widespread popularity that this genre has assumed has prompted research into the linguistic and stylistic elements that contribute to its success, mirroring studies conducted on more traditional literary genres <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4]</ref>, among others.</p><p>Understanding the elements that contribute to narrative success is a fascinating area of research with implications across various fields, from literary analysis to digital humanities. From a socio-linguistic perspective, it can offer deeper insights into people and culture. It also has significant applications in areas such as personalized content recommendation and educational technology <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref>. While personal interests undoubtedly play a crucial role in predicting a reader's engagement with a literary content, the way information is presented can also evoke different reactions and levels of interaction, ultimately influencing the narrative's success. In this regards, recent advancements in Natural Language Processing (NLP) and machine learning offer a powerful lens for making explicit patterns that may explain the complex interplay between reader engagement and content success.</p><p>This paper moves in this field and presents a computational analysis focused on Italian fanfiction, addressing the following research questions: i.) Can the success of Italian fanfiction be automatically predicted using stylistic and lexical features of the texts?; ii.) Which types of features demonstrate the highest predictive capability, and how consistent are these features across different time periods and thematic domains?; iii.) To what extent can these features be explained in terms of their contribution to predicting success?</p><p>Our contributions. i.) We collected a corpus of Italian fanfiction stories enriched with metadata considered as proxies of their success; ii.) We investigate the relationship between stylistic and lexical features of stories and their success from a modeling perspective; iii.) We identified the most influential features in success prediction, showing the key role played by form and stylistic related features across time and thematic domains of fanfictions.</p><p>The paper is structured as follows: Section 2 briefly contextualizes our study among relevant literature; Section 3 presents the reference corpus of Italian fanfiction stories that we collected; in Section 4 we provide an overview of the approach we devised including the description of features used for classification and the classifiers employed. Section 5 discusses the main findings and offers a fine-grained analysis of the classification results in terms of feature explainability. In Section 6 we summarize key findings and outlining promising directions for future research in this field.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>The exploration of online content and its engagement levels has increasingly benefited from advancements in NLP and machine learning. Different perspectives have been touched upon considering different textual domains, typology of linguistic features and quantitative metrics to operationalize a very subjective concept like success. The study by Toubia and colleagues <ref type="bibr" target="#b6">[7]</ref> explores how the structure of narratives, particularly the internal semantic progression measured by features derived from dense word representations, affects the success of stories across different text typologies (movies, TV shows, and academic papers). Berger and colleagues <ref type="bibr" target="#b7">[8]</ref> examine how the linguistic structure of online content affects user engagement, specifically by modeling sustainable attention. This concept goes beyond just attracting a reader with a catchy headline or advertisement; it also encompasses the likelihood that a reader will continue viewing or reading the content. In their analysis of more than 35,000 online contents from heterogeneous sources, they emphasize the role of features related to processing ease and emotional language.</p><p>In the realm of literary works, Ashok et al. <ref type="bibr" target="#b1">[2]</ref> first leverage stylometric analysis and machine learning techniques to predict the success of popular English novels from the Gutenberg Project. Their approach demonstrated the potential of these techniques for assessing literary success. Extending these findings, Maharajan et al. <ref type="bibr" target="#b8">[9]</ref> proposed a multi-task approach to simultaneously evaluating success and genre prediction. Using deep learning representations, in addition to hand-craft features related to topic, sentiment, writing style, and readability of books, they obtained better performance than the single success prediction task approach. Focusing on contemporary English-language literature, the study by Bizzoni and colleagues <ref type="bibr" target="#b9">[10]</ref> investigate how perceived novel quality is influenced by a broad spectrum of textual features -such as those related to readability and sentiment -and how these perceptions vary depending on the reader's level of expertise.</p><p>The growing volume of online fanfiction has also been the subject of numerous studies, either from the perspective of text mining by using NLP or through a qualitative lens via a manual examination. A comprehensive survey of analyses in this direction has been recently provided by <ref type="bibr" target="#b11">[11]</ref>. For example, Milli and Bamman <ref type="bibr" target="#b12">[12]</ref> explore the relationship between fanfiction and its original canon, offering one of the first empirical analyses of this genre. Similarly, Sourati et al. <ref type="bibr" target="#b13">[13]</ref> find that the similarity between fanfictions and their original stories -particularly in terms of emotional arcs and character dynamics-correlates significantly with fanfiction's popularity.</p><p>In the context of Italian fanfiction, research using NLP techniques is still limited. Mattei et al. <ref type="bibr" target="#b14">[14]</ref> employ linguistic profiling to analyze a corpus of Italian fanfiction inspired by the Harry Potter series, with the purpose of identifying linguistic patterns associated with success.</p><p>Inspired by this previous study, our research aims to extend these findings through a computational modeling approach, investigating the power of linguistic features for predicting fanfiction success and their generalization across different experimental settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Corpus Construction</head><p>As a first step, we compiled a reference corpus of Italian fanfiction. To this end, we searched available texts on efpfanfic.net, one of the largest Italian websites dedicated to publishing and reading amateur stories, focusing specifically on stories labeled in the fanfiction genre. Using a web scraping system, we extracted fanfictions based on the Harry Potter series, a highly popular fandom on the site, boasting 57,196 stories published between 2003 and 2023. Figure <ref type="figure" target="#fig_0">1</ref> presents the temporal distribution of these fanfictions up to 2020.</p><p>Additionally, we gathered a secondary corpus consisting of 2,441 stories based on The Lord of the Rings series. This secondary corpus served as a test set to assess the influence of thematic domains on the analysis of story success.</p><p>For this study, we focused on the first chapter of each fanfiction to ensure a consistent analysis. While it is widely recognized that thematic units within storiesparticularly the beginnings and endings -often differ from the middle sections due to their distinct narrative roles, we observed that the majority of stories (69%) consist of only a single chapter, making them effectively selfcontained. The efpfanfic portal allows users to review each chapter with ratings marked as negative, neutral, or positive. Consistent with prior research such as <ref type="bibr" target="#b8">[9]</ref> we used the absolute number of reviews to define the success of a story, which we consider broadly as popularity. This approach is based on the assumption that a high number of interactions, regardless of their sentiment, reflects strong reader's engagement. This is especially confirmed since in our dataset negative reviews represent less than 1% of the total.</p><p>To formulate our success prediction task, we established a review threshold to classify each story as either a success or a failure. After analyzing the distribution of reviews for Harry Potter texts (Figure <ref type="figure" target="#fig_1">2</ref>), we decided to exclude stories that fell in the middle of the distributionthose that could not be clearly defined as successes or failures. Consequently, stories with fewer than two reviews (25th percentile) were classified as failures, and those with more than six reviews (75th percentile) as successes. Stories within the interquartile range were excluded from   the analysis. We also excluded texts published after 2020, considering them too recent for meaningful comparison.</p><p>As summarized in Table <ref type="table" target="#tab_0">1</ref>, the final corpora, hereafter abbreviated as HP (Harry Potter) and LOTR (The Lord of the Rings), consist of 26,032 and 932 texts, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Methodology</head><p>Based on the newly collected dataset and its internal distinction, we formulated the task of success prediction as a binary classification problem, that is: given a story, the model is asked to predict whether it belongs to the successful or unsuccessful class, where the two classes were defined according to the metric based on the number of reviews received by readers.</p><p>In line with our main purpose to construct a model of success grounded on interpretable factors, we decided to leverage explicit features modelling both style-related and lexical aspects of text as input for the classification system. To evaluate the effectiveness and robustness of these features, we conducted experiments across three conceptually distinct scenarios to evaluate the ability to discriminate success in different contexts. The main components of our approach are detailed in the following sections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Success Predictors</head><p>A comprehensive set of features was extracted for each story in the corpus. These features were categorized into two primary groups: linguistic features, reflecting the text's linguistic style and structure and lexical features, representing the semantic content of the text.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1.">Linguistic Features</head><p>To model text's linguistic style and structure, we drew inspiration from the linguistic profiling framework, a NLPbased methodology in which a large set of linguisticallymotivated features automatically extracted from annotated texts is used to obtain a vector-based representation of it. Such representations can be then compared across texts representative of different textual genres and varieties to identify the peculiarities of each <ref type="bibr" target="#b15">[15]</ref>. For our study, we relied on Profiling-UD<ref type="foot" target="#foot_0">1</ref> , a multilingual tool inspired by this framework, which extracts over 130 linguistic features from texts using the Universal Dependencies (UD) annotation formalism. As described in Brunato et al. <ref type="bibr" target="#b16">[16]</ref>, these features encompass a range of linguistic phenomena that can be classified into distinct groups covering e.g. shallow text features (e.g. document and sentence length, average word length), distribution of grammatical categories, inflectional morphology and syntactic properties related to local and global parse tree depth structure.</p><p>These features have proven effective in tasks related to modeling text form, such as assessing text complexity, and identifying stylistic traits of authors or author groups. Building on previous research on a similar corpus of fanfiction <ref type="bibr" target="#b14">[14]</ref>, we hypothesize that these features can also distinguish between successful and unsuccessful fanfictions from a modeling perspective.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.2.">Lexical Features</head><p>The second representation employed is based on lexical information and leverages the relative frequency of n-grams in each document. The choice of n-grams, in contrast to more powerful semantic representation derived from embeddings, is deliberately motivated by the desire to use lexical features that remain completely explicit. The model, henceforth referred to as the Lexical Model, consists of the following features:</p><p>• Forms: unigrams, bigrams, and trigrams of tokens. • Lemmas: unigrams, bigrams, and trigrams of lemmas. • Characters: sequences of characters at the beginning or end of words, ranging from 1 to 4 characters in length.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Classifiers</head><p>In line with our research questions, the explainability of the classification is crucial to evaluate the impact of linguistic and lexical features on the prediction of success. Therefore, two classification algorithms that allow for a precise global explanation of the predictions were selected.</p><p>The first classifier employed is a linear Support Vector Machine. By fitting a decision hyperplane in the feature space, this method enables the examination of the hyperplane's coefficients to assess the importance of the features.</p><p>The second algorithm employed is the Explainable Boosting Machine (EBM), which belongs to the family of Generalized Additive Models (GAMs). As explained in <ref type="bibr" target="#b17">[17]</ref> a GAM is a model of the form:</p><formula xml:id="formula_0">𝑔(𝑦) = 𝛽0 + ∑︁ 𝑓𝑛(𝑥𝑛)<label>(1)</label></formula><p>where 𝑔(.) is called the link function, used to model the output (e.g., the logistic function for classification). Each 𝑓𝑛(.) is referred to as a shape function, which is a univariate function modeling the relationship between the feature 𝑛 and the target.</p><p>The prediction is thus a sum of 𝑛 non-linear and arbitrarily complex shape functions, generally resulting in better accuracy compared to linear models. Additionally, with a reasonable number of features, the model remains explainable. Each shape function can be visualized as a two-dimensional plot, with the feature value on the x-axis and the score assigned by the shape function on the y-axis. A score greater than 0 indicates a contribution towards the positive class, whereas a score less than 0 indicates a contribution towards the negative class. The final prediction value for a record is simply the sum of the scores obtained from each shape function, potentially transformed by the link function. Beyond analyzing individual shape functions, the average contribution of each feature can be evaluated by taking the mean of the absolute values of the assigned scores.</p><p>There are various algorithms within the family of GAMs, primarily distinguished by the method used to fit the shape functions. In the case of the EBM, standard gradient boosting is used. However, in each boosting iteration, the algorithm sequentially cycles through each feature, constructing each univariate shape function through bagged boosted trees. This method has proven to be one of the most effective for training a GAM.</p><p>For our study, the EBM was employed exclusively for experiments based on linguistic features due to the excessive dimensionality of the lexical model. This high dimensionality would have rendered the GAM too complex to interpret and too time-expensive to train.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results and Discussion</head><p>The classification results are summarized in Table <ref type="table" target="#tab_1">2</ref>, for each model and scenario under evaluation.</p><p>For models using linguistic features, in the in-domain scenario both the SVM and the EBM outperform the majority class baseline, with accuracies of 65.03% and 66.15% respectively, compared to 50.16% for the baseline. This indicates that both classifiers are effectively capturing the linguistic patterns associated with success within the same thematic domain.</p><p>For linguistic models, in the out-domain scenario the performance of the SVM drops significantly, with an accuracy of 59.22%, whereas the EBM experiences a less The lexical model, in the in-domain scenario, achieves an accuracy of 69.56%, outperforming all models with linguistic features, suggesting that lexical features provide a more powerful representation for in-domain success prediction. Nevertheless, in the out-domain scenario, the lexical model does not surpass the baseline, indicating a complete lack of predictive ability. This suggests that lexical features, which are primarily based on the content of the specific fanfiction's narrative universe, perform well within the same thematic domain but lose all significance outside of it. Conversely, linguistic features, which focus on the form of the text, appear to be more adaptable regardless of the theme.</p><p>Figure <ref type="figure" target="#fig_3">3</ref> presents the performance over time for classifiers trained with linguistic features. Additionally, two baselines are shown: "Random Choice", which randomly selects between the two classes, and "Maj. Class", which always assigns the majority class from the corresponding training set (2011 stories), i.e. the positive one. The results of the lexical model in the cross-time scenario were insignificant, as they were very similar to the "Maj. Class" baseline. The classifier, therefore, defaults to assigning the negative class, demonstrating no predictive capability. To avoid confusion, the lexical model results are not included in this Figure <ref type="figure">.</ref> In contrast, the crosstime results for models using linguistic features are more meaningful: the results remain stable around an average of 62%, regardless of the dominant class in the tested year and the classifier used (avg. cross-time in Table <ref type="table" target="#tab_1">2</ref>).</p><p>The cross-time scenario further suggests that linguistic features possess greater adaptability beyond their own domain, maintaining a considerable degree of generalization over time. Conversely, lexical features seem functional only within the specific domain of the training set, losing all predictive power for texts from different domains. Overall the model that performed best on average across the three scenarios, and with the least variance in performance, is the EBM trained with linguistic fea-tures. We provide an in-depth analysis of this model in the following section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">The Model of Success</head><p>To gain a better understanding of the classification results and identify the most influential features for predicting success, we ranked the features according to the absolute value of their weight in the EBM classifier model trained on the entire training set. Table <ref type="table" target="#tab_2">3</ref> presents an extract of the top 15 features. The analysis reveals that, in addition to basic text features such as the average document length (measured in tokens <ref type="bibr" target="#b0">[1]</ref>) and the average word length (in characters <ref type="bibr" target="#b1">[2]</ref>), more complex linguistic properties play a crucial role. Among these, features related to verbal predicates and verbal morphology emerge as particularly influential. This suggests that the syntactic and morphological characteristics of verbs, such as tense, mood and person, provide valuable information for the classifier prediction, highlighting the importance of deeper linguistic structures in building a model of successful writing.</p><p>While this ranking highlights the 'global' importance of features, it does not explain their effect on classification. For a more detailed analysis, Figure <ref type="figure" target="#fig_4">4</ref> in Appendix A highlights the threshold values for each of the top 15 ranked features, indicating the point at which the expected classification shifts from one class to another. Additionally, it provides the number of instances in the training set for each feature value. Interestingly, there are some features which split almost exactly the amount of data into two subsets. For example, the features representing word length (char_per_tok) has a discriminant threshold of 4.55 characters which distinguishes successful stories -typically with longer words -from unsuccessful ones -usually with shorter words. Similarly, features related to the (morpho-)syntactic profile of the text such as the percentage of conjunctions (dep_dist_conj) and non-finite verb forms (verbs_form_dist_Fin) show a similar pattern. For these features, values lower than the discriminant threshold contribute to predicting the negative class, effectively splitting the data into two groups with comparable densities. Regarding verb presence (ver-bal_head_per_sentence), an increased use of verbs correlates with the unsuccessful class. This finding contradicts the idea that higher readability, typically conveyed by a predominantly verbal prose rather than a nominal one, is a good indicator of writing quality. However, it aligns with observations by Ashok et al. <ref type="bibr" target="#b1">[2]</ref>, who identified similar patterns in canonical literary novels.</p><p>Features related to verbal morphology also show a peculiar trend. For instance, a complementary perspective emerges concerning the use of person morphology. Increasing the use of second person plural beyond a relatively low threshold (0.4) positively affects the prediction of success, which may indicate an alignment with the Reader-Insert<ref type="foot" target="#foot_1">2</ref> format, a specific type of fanfiction where the reader assumes the role of the protagonist, heavily relying on second-person narration. In contrast, an excessive use of the first person plural is associated with the negative class. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>Understanding success factors in literary writing is an evolving area of cross-disciplinary research. This study on Italian fanfiction demonstrated the feasibility of predicting success using computational methods and explainability techniques. Notably, we found that features related to style and structure of texts show greater robustness than lexical ones across different domains and time periods. This suggests that the way a story is crafted may be more universally appealing than specific word choices or thematic elements. We believe that the implications of this study extend far beyond fanfiction research. On the one hand, it provides new methodologies for analyzing online literary phenomena offering potential contributions to digital humanities. From the NLP perspective, it could inform text generation models, potentially guiding the creation of content that resonates more effectively with readers.</p><p>Future research could explore the generalizability of these findings to other languages and genres, as well as the investigation on the dynamics of evolving reader preferences over time by also considering alternative measures to gauge success. Additionally, this study does not take into account the importance of the author; a potential future development would be to consider the n_tokens, char_per_tok, n_sentences, and n_prepositional_chains), the values are displayed as raw counts. For the remaining features, which are expressed as percentage distributions, the values are shown accordingly. More details about how these features are calculated are reported in <ref type="bibr" target="#b16">[16]</ref>.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Distribution of all fanfictions from the Harry Potter corpus by year of publication (up to 2020).</figDesc><graphic coords="3,89.29,172.47,203.37,121.87" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Distribution of published fanfiction from the Harry Potter corpus by number of reviews in the first chapter.</figDesc><graphic coords="3,89.29,335.40,203.37,116.47" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head></head><label></label><figDesc>Specifically, the first scenario is in-domain: the classifier is evaluated on texts within the same thematic domain as the training set, using 10-fold cross-validation on the HP corpus. The second scenario is out-domain: the classifier is evaluated on texts from a different thematic domain than the training set. In this case, the HP corpus is used as the training set, while the LOTR corpus serves as the test set. Finally, in the cross-time scenario, the temporal impact on classification is considered. The classifier is trained solely on texts from the HP corpus published in 2011 and sequentially tested on texts from each other year from 2003 to 2020. The 2011 texts were chosen for training because this year has the largest amount of data (3,755 texts), is approximately central within the temporal range [2003, 2020], and is particularly significant for fanfiction production due to the release of the final film in the Harry Potter saga.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Classification Accuracy in the Cross-Time Setting</figDesc><graphic coords="5,89.29,84.19,203.36,94.83" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Visualization of the Shape Functions of the Top 15 Linguistic Features of the EBM. In each graph pair, the x-axis represents the feature value, the y-axis of the line plot indicates the score assigned by the shape function, and the marked threshold value denotes the feature value at the zero score point. For the features represented by absolute numbers (i.e.n_tokens, char_per_tok, n_sentences, and n_prepositional_chains), the values are displayed as raw counts. For the remaining features, which are expressed as percentage distributions, the values are shown accordingly. More details about how these features are calculated are reported in<ref type="bibr" target="#b16">[16]</ref>.</figDesc><graphic coords="8,89.29,84.19,416.70,573.18" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Descriptive Statistics for the Harry Potter (HP) and Lord of The Rings (LOTR) Corpora</figDesc><table><row><cell>Corpus</cell><cell>#texts</cell><cell cols="3">#negatives #positives avg. #tok</cell></row><row><cell>HP</cell><cell>26,032</cell><cell>13,058</cell><cell>12,974</cell><cell>1911</cell></row><row><cell>LOTR</cell><cell>932</cell><cell>526</cell><cell>406</cell><cell>1946</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Classification Accuracy(%) of the Models. 'Ling.' and 'Lex.' refer respectively to models trained on linguistic and lexical features. The baseline corresponds to the majority class label.</figDesc><table><row><cell>Scenario</cell><cell cols="4">SVM Ling. EBM Ling. SVM Lex. Baseline</cell></row><row><cell>in-domain</cell><cell>65.03</cell><cell>66.15</cell><cell>69.95</cell><cell>50.16</cell></row><row><cell>out-domain</cell><cell>59.22</cell><cell>64.70</cell><cell>43.45</cell><cell>56.43</cell></row><row><cell>avg. cross-time</cell><cell>62.02</cell><cell>62.81</cell><cell>49.31</cell><cell>49.20</cell></row><row><cell>average</cell><cell>62.09</cell><cell>64.55</cell><cell>54.24</cell><cell>51.93</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Top 15  Scores of the EBM Trained with Linguistic Features</figDesc><table><row><cell>#</cell><cell>feature</cell><cell>score</cell></row><row><cell>#1</cell><cell>n_tokens</cell><cell>0.121</cell></row><row><cell>#2</cell><cell>char_per_tok</cell><cell>0.098</cell></row><row><cell>#3</cell><cell>verbal_root_perc</cell><cell>0.095</cell></row><row><cell>#4</cell><cell cols="2">verbs_num_pers_dist_Plur+2 0.090</cell></row><row><cell>#5</cell><cell cols="2">verbs_num_pers_dist_Plur+1 0.088</cell></row><row><cell>#6</cell><cell>upos_dist_SYM</cell><cell>0.080</cell></row><row><cell>#7</cell><cell>n_sentences</cell><cell>0.077</cell></row><row><cell>#8</cell><cell>aux_tense_dist_Imp</cell><cell>0.077</cell></row><row><cell>#9</cell><cell>verbs_tense_dist_Imp</cell><cell>0.072</cell></row><row><cell cols="2">#10 aux_tense_dist_Pres</cell><cell>0.067</cell></row><row><cell cols="2">#11 verbal_head_per_sent</cell><cell>0.066</cell></row><row><cell cols="2">#12 dep_dist_conj</cell><cell>0.065</cell></row><row><cell cols="2">#13 tokens_per_sent</cell><cell>0.064</cell></row><row><cell cols="2">#14 verbs_form_dist_Fin</cell><cell>0.053</cell></row><row><cell cols="2">#15 n_prepositional_chains</cell><cell>0.052</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://linguistic-profiling.italianlp.it/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://fanlore.org/wiki/Reader-Insert impact of the author's popularity and productivity on the success of their fanfiction.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Hellekson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Busse</surname></persName>
		</author>
		<title level="m">Fan fiction and fan communities in the age of the internet: new essays</title>
				<imprint>
			<publisher>McFarland</publisher>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Success with style: Using writing style to predict the success of novels</title>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">G</forename><surname>Ashok</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Choi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2013 conference on empirical methods in natural language processing</title>
				<meeting>the 2013 conference on empirical methods in natural language processing</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="1753" to="1764" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Modeling and predicting literary reception. a data-rich approach to literary historical reception</title>
		<author>
			<persName><forename type="first">J</forename><surname>Brottrager</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Stahl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Arslan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Brandes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Weitin</surname></persName>
		</author>
		<idno type="DOI">10.48694/jcls.95</idno>
		<ptr target="https://doi.org/10.48694/jcls.95" />
	</analytic>
	<monogr>
		<title level="j">Journal of Computational Literary Studies</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Algee-Hewitt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Allison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gemma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Heuser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Moretti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Walser</surname></persName>
		</author>
		<ptr target="https://litlab.stanford.edu/LiteraryLabPamphlet11.pdf" />
		<title level="m">Canon/archive : large-scale dynamics in the literary field</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<ptr target="https://api.semanticscholar.org/CorpusID:265096028" />
		<title level="m">Reviews matter: How distributed mentoring predicts lexical diversity on fanfiction</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Fan fiction and informal language learning, The handbook of informal language learning</title>
		<author>
			<persName><forename type="first">S</forename><surname>Sauro</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="139" to="151" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">How quantifying the shape of stories predicts their success</title>
		<author>
			<persName><forename type="first">O</forename><surname>Toubia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Berger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Eliashberg</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:235648521" />
	</analytic>
	<monogr>
		<title level="j">Proceedings of the National Academy of Sciences of the United States of America</title>
		<imprint>
			<biblScope unit="volume">118</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">What holds attention? linguistic drivers of engagement</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Berger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">W</forename><surname>Moe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Schweidel</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:255250393" />
	</analytic>
	<monogr>
		<title level="j">Journal of Marketing</title>
		<imprint>
			<biblScope unit="volume">87</biblScope>
			<biblScope unit="page" from="793" to="809" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A multi-task approach to predict likability of books</title>
		<author>
			<persName><forename type="first">S</forename><surname>Maharjan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Arevalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Montes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">A</forename><surname>González</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Solorio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics</title>
		<title level="s">Long Papers</title>
		<meeting>the 15th Conference of the European Chapter of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1217" to="1227" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Bizzoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">F</forename><surname>Moreira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">M S</forename><surname>Lassen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Thomsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Nielbo</surname></persName>
		</author>
		<title level="m">A matter of perspective: Building a multi-perspective annotated dataset for the study of literary quality</title>
				<editor>
			<persName><forename type="first">N</forename><surname>Calzolari</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M.-Y</forename></persName>
		</editor>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<ptr target="https://aclanthology.org/2024.lrec-main.71" />
		<title level="m">Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL</title>
				<editor>
			<persName><forename type="first">V</forename><surname>Kan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Hoste</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Lenci</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Sakti</surname></persName>
		</editor>
		<editor>
			<persName><surname>Xue</surname></persName>
		</editor>
		<meeting>the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL<address><addrLine>Torino, Italia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="789" to="800" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Big data meets storytelling: using machine learning to predict popular fanfiction</title>
		<author>
			<persName><forename type="first">D</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zigmond</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Glassco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Giabbanelli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Social Network Analysis and Mining</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page">58</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Beyond canonical texts: A computational analysis of fanfiction</title>
		<author>
			<persName><forename type="first">S</forename><surname>Milli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bamman</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/D16-1218</idno>
		<ptr target="https://aclanthology.org/D16-1218.doi:10.18653/v1/D16-1218" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Su</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Duh</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Carreras</surname></persName>
		</editor>
		<meeting>the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics<address><addrLine>Austin, Texas</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="2048" to="2053" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Quantitative analysis of fanfictions&apos; popularity</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Sourati Hassan Zadeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Sabri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chamani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bahrak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Social Network Analysis and Mining</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page">42</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">The style of a successful story: a computational study on the fanfiction genre</title>
		<author>
			<persName><forename type="first">A</forename><surname>Mattei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Brunato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dell'orletta</surname></persName>
		</author>
		<ptr target="https://ceur-ws.org/Vol-2769/paper_52.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Seventh Italian Conference on Computational Linguistics, CLiC-it 2020</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<editor>
			<persName><forename type="first">J</forename><surname>Monti</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Dell'orletta</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Tamburini</surname></persName>
		</editor>
		<meeting>the Seventh Italian Conference on Computational Linguistics, CLiC-it 2020<address><addrLine>Bologna, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">March 1-3, 2021. 2020</date>
			<biblScope unit="volume">2769</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Linguistic profiling for authorship recognition and verification</title>
		<author>
			<persName><forename type="first">H</forename><surname>Van Halteren</surname></persName>
		</author>
		<idno type="DOI">10.3115/1218955.1218981</idno>
		<ptr target="https://aclanthology.org/P04-1026.doi:10.3115/1218955.1218981" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)</title>
				<meeting>the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)<address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="199" to="206" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Profiling-UD: a tool for linguistic profiling of texts</title>
		<author>
			<persName><forename type="first">D</forename><surname>Brunato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Cimino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dell'orletta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Venturi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Montemagni</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2020.lrec-1.883" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Twelfth Language Resources and Evaluation Conference, European Language Resources Association</title>
				<editor>
			<persName><forename type="first">N</forename><surname>Calzolari</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Béchet</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Blache</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Choukri</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Cieri</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Declerck</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Goggi</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Isahara</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Maegaard</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Mariani</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Mazo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Moreno</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Odijk</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Piperidis</surname></persName>
		</editor>
		<meeting>the Twelfth Language Resources and Evaluation Conference, European Language Resources Association<address><addrLine>Marseille, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="7145" to="7151" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Intelligible models for classification and regression</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Lou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Caruana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gehrke</surname></persName>
		</author>
		<idno type="DOI">10.1145/2339530.2339556</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</title>
				<meeting>the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
	<note>A. Top 15 Features of the EBM</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
