<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Multilingual Stylometry: The Influence of Language on the Performance of Authorship Attribution using Corpora from the European Literary Text Collection (ELTeC)</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Christof</forename><surname>Schöch</surname></persName>
							<email>schoech@uni-trier.de</email>
							<affiliation key="aff0">
								<orgName type="department">Trier Center for Digital Humanities</orgName>
								<orgName type="institution">Trier University</orgName>
								<address>
									<settlement>Trier</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Julia</forename><surname>Dudar</surname></persName>
							<email>dudar@uni-trier.de</email>
							<affiliation key="aff0">
								<orgName type="department">Trier Center for Digital Humanities</orgName>
								<orgName type="institution">Trier University</orgName>
								<address>
									<settlement>Trier</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Evgeniia</forename><surname>Fileva</surname></persName>
							<email>fileva@uni-trier.de</email>
							<affiliation key="aff0">
								<orgName type="department">Trier Center for Digital Humanities</orgName>
								<orgName type="institution">Trier University</orgName>
								<address>
									<settlement>Trier</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Artjoms</forename><surname>Šeļa</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Institute of Polish Language</orgName>
								<orgName type="institution">Polish Academy of Sciences</orgName>
								<address>
									<settlement>Kraków</settlement>
									<country key="PL">Poland</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">Institute of Czech Literature</orgName>
								<orgName type="institution">Czech Academy of Sciences</orgName>
								<address>
									<settlement>Prague</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Multilingual Stylometry: The Influence of Language on the Performance of Authorship Attribution using Corpora from the European Literary Text Collection (ELTeC)</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">AD8F154E4B71A3790B0F54BCE538BE22</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:48+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Stylometry</term>
					<term>Authorship Attribution</term>
					<term>Multilingualism</term>
					<term>ELTeC</term>
					<term>Translation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Stylometric authorship attribution is concerned with the task of assigning texts of unknown, pseudonymous or disputed authorship to their most likely author, often based on a comparison of the frequency of a selected set of features that represent the texts. The parameters of the analysis, such as feature selection and the choice of similarity measure or classification algorithm, have received significant attention in the past. Two additional key factors for the performance and reliability of stylometric methods, however, have so far received less attention, namely corpus composition and corpus language. As a first step, the aim of this study is to investigate the influence of language on the performance of stylometric authorship attribution. We address this question using four different corpora derived from the European Literary Text Collection (ELTeC). We use machine-translation to obtain each corpus in the other three languages. We find that, as expected, the attribution accuracy varies between language-based corpora, and that translated corpora, on average, display a lower attribution accuracy compared to their counterparts in the original language. Overall, our study contributes to a better understanding of stylometric methods of authorship attribution.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Stylometric authorship attribution is the task of assigning texts of unknown, pseudonymous or disputed authorship to their most likely author, often based on a comparison of the frequency of a selected set of features that represent texts <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b25">26,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b22">23]</ref>.</p><p>Traditionally, the way stylometric methods approach authorship attribution is to use the frequencies of a large number of simple features, such as words, lemmas or character sequences, for a determination of the degree of similarity between texts. These similarities, in turn, are in-terpreted as an indicator of the likelihood for two texts to have been written by the same author: the more similar the feature vectors, the more likely is identical authorship. The parameters of the analysis, such as feature selection and the choice of similarity measure or classification algorithm, have received significant attention in the past (e.g. <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b24">25,</ref><ref type="bibr" target="#b26">27,</ref><ref type="bibr" target="#b10">11]</ref>). Two additional key factors for the performance and reliability of stylometric methods, however, are corpus composition and corpus language. They are relevant not only for the results in a specific case, but also for the overall performance and reliability of stylometric methods of authorship attribution.</p><p>Therefore, the aim of ongoing research by our group is to disentangle the influence of corpus composition and language on the performance of stylometric authorship attribution: To what extent do the attribution accuracy and robustness of such approaches depend on the language of the materials? To what extent do they depend on the corpus composition, which can present more or less challenging constellations of authorship and style? How do these two factors interact with each other, and how do they interact with feature selection?</p><p>This paper tackles one part of this issue, that of language, by investigating four distinct but broadly comparable corpora in a classification scenario. The corpora are all derived from the European Literary Text Collection (ELTeC) and contain at least three novels each by 8-10 different undisputed authors. These corpora can therefore be used as benchmarking datasets, where the true authorship is known for all novels under investigation. Specifically, corpora in English, French, Hungarian and Ukrainian are included. While the corpora contain similar texts (fictional narrative prose from a similar time period, i.e., 1840-1920), they differ both in terms of their exact composition, which can affect overall difÏculty of attribution and therefore attribution performance, and in terms of their language, which can likewise affect attribution performance.</p><p>In order to investigate the role of language independently of corpus composition, all four corpora were automatically translated into the other three languages using the DeepL machine translation system (Pro version, used in the time period July to September 2023). This allows us to vary language while keeping corpus composition stable, and in this manner, tease out effects of both in terms of attribution performance (measured as classification accuracy).</p><p>In addition, and in order not to disadvantage any one corpus or language by selecting features that may not be suitable to them, a number of further parameters have been varied, namely: the type of feature considered (word forms, lemmas, part of speech or characters); the length of the feature sequence considered as the unit of analysis (unigrams, bigrams, 3grams, 4-grams or 5-grams); the total number of different features considered (from 50 to 2000 in several increments); and the length of contiguous textual segments considered (from 5000 words to 10000 words as well as using entire novels).</p><p>Testing all of these various parameters, corpora and languages in all of their possible combinations results in a large number of individual results. In order to make them accessible to inspection and analysis, we have developed an interactive visualization that displays the attribution quality (i.e. the classification accuracy) in a heatmap as a function of the parameters described above. While the heatmap displays the accuracy for the entire range of features and segment lengths at the same time, other parameters can be selected by the users in order for the heatmap to be updated with the corresponding accuracy values. Two such heatmaps are provided next to each other to enable convenient comparison of results between any two configurations of parameters, whether corpora, languages or other settings.</p><p>The present paper can be understood as a background publication to these heatmaps published as a publicly available, interactive online showcase that is available at https://showca ses.clsinfra.io/stylometry. The target audience of this showcase are scholars and students interested in a key methodological aspect of stylometry-based authorship attribution as well as those interested in cross-lingual approaches within the digital humanities. The showcase is intended both as a convenient way to document a specific research output and as a pedagogical tool that could be used in workshop or classroom settings alike. Readers are encouraged to use the showcase online, where much more detailed results can be explored, beyond the high-level summary of results we present here.</p><p>The detailed results are discussed below, and contribute to our general understanding of the influence of language and corpus composition on stylometric results. Broadly speaking, we can show that corpora of different language and composition lead to different attribution accuracy levels and different best-performing features, an expected result. We can also show that translated corpora (at least when all texts have been translated by the same machine translation system) usually lead to a lower attribution accuracy, overall, compared to their counterpart in the original language.</p><p>In the following, we first discuss relevant prior work evaluating stylometric performance in the context of multilingual corpora (section 2). We then outline our research questions and objectives (section 3), before describing our data and methods (section 4 and section 5). We discuss the main outcomes with respect to attribution performance, detailing corpus-based and language-based variations in results (section 6). By way of a conclusion, we discuss the consequences of our findings, document some limitations of this research and outline ongoing research into the second aspect of this investigation, namely corpus composition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Relevant prior work</head><p>There has been a significant body of work investigating the methodological and theoretical foundations of stylometric authorship attribution. Generally speaking, it seems fair to say that most attention has been devoted to feature selection and the choice of distance measures or classification algorithms. While much research is focused on English-language materials, a limited amount of attention has been given to the question of variation across languages. Only rarely, however, has the influence of corpus composition on attribution performance been assessed beyond general statements about the advantages of generically-homogeneous corpora. In the following, we discuss some of this research in more detail.</p><p>With respect to investigations of feature selection, John Burrows <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b3">4]</ref> proposes multiple measures each focused on a different part of the frequency spectrum: Delta, Iota and Zeta. Smith and Aldridge <ref type="bibr" target="#b24">[25]</ref> use word forms in order to focus on the best setting for most frequent words and word vector onset and have found that "a word frequency vector of between 200 and 300 words gives the most accurate results" when using Burrows' Delta. They also show that shifting the feature vector onset, effectively skipping the most frequent features, is detrimental to attribution accuracy.</p><p>In addition to varying the number of most frequent words, examples of using features other than words can also be found in the research literature. In particular, Stamatatos <ref type="bibr" target="#b26">[27]</ref> focuses on the effectiveness of using character n-grams. He has conducted tests of 3-grams and frequent words on texts of different genres and topics. The researcher views as an advantage of n-grams that they capture linguistic patterns and may therefore be suitable to represent an author's style beyond word choice. He has obtained the best results with a "model based on character 3-grams in combination with an SVM classifier with linear kernel", and suggests using this as a baseline for authorship attribution. Building on the conventional use of word frequencies, Halteren et al. <ref type="bibr" target="#b12">[13]</ref> investigate the application of syntactic features in stylometry. Their research highlights how syntactic patterns, based on word class tags, complement lexical features in finding what they call the "human stylome", i.e. individual measurable linguistic features that help distinguishing between authors. The study shows that although 100% efficiency results could not be achieved, lexical features perform slightly more efÏciently than syntactic features and that in general, a greater number and/or variety of features improves the results. In a similar vein, Gorman <ref type="bibr" target="#b11">[12]</ref> uses syntactic information for the attribution of short texts.</p><p>With respect to distance or similarity measures, Argamon <ref type="bibr" target="#b0">[1]</ref> proposes a geometric interpretation of Burrows' Delta and derived several alternatives to this particular distance measure, among them Quadratic Delta. Smith and Aldridge <ref type="bibr" target="#b24">[25]</ref> propose to move beyond distance-based comparisons and suggest the adoption of the Cosine similarity instead, where the directions of the vectors are compared without being influenced by their length. Evert et al. <ref type="bibr" target="#b10">[11]</ref> follow this important new direction and propose Cosine Delta, which combines z-score normalization with the Cosine measure. They have confirmed the finding by Smith and Aldridge <ref type="bibr" target="#b24">[25]</ref>, namely that performance is more stable for longer feature vectors when Cosine Delta is used. In addition, they show that this effect holds across several languages, at least for long fiction narratives (novels). In several cases, interactions between various factors (in particular between feature vector length, distance measure and language) have been observed.</p><p>With respect to languages, many methodological investigations focus on English corpora only, as is the case with Burrows <ref type="bibr" target="#b4">[5]</ref> or Smith and Aldridge <ref type="bibr" target="#b24">[25]</ref>. Rybicki and Eder <ref type="bibr" target="#b18">[19]</ref>, however, show that results vary widely as a function of language when they investigated feature vector length and onset for multiple corpora in different languages and covering various literary genres. Evert et al. <ref type="bibr" target="#b10">[11]</ref> test all their hypotheses on English, German and French corpora consisting of novels, with only minor differences between languages. However, the extent to which the differences in performance can be explained by language, on the one hand, and by the inevitable differences between corpora with respect to their genre and/or composition, is not within the scope of their research.</p><p>Stylometric research into translated works does exist, but is relatively rare. Rybicki and Heydel <ref type="bibr" target="#b21">[22]</ref> show that in the case where several different translators translate the works of a given author, their stylistic signature can be detected by stylometric methods. In other cases, especially in corpora involving multiple authors and multiple translators, the authorial signal appears to be stronger than the translator's signal <ref type="bibr" target="#b20">[21]</ref>. In a recent study, Rybicki <ref type="bibr" target="#b19">[20]</ref> focuses on network analysis, using an extensive collection of translated texts from different genres and languages into Polish. He observes that the translated and original texts reveal stylistic similarities, from which it can be concluded that the translations are an extension of native literary traditions. Rybicki also notes that the distinctiveness of genres becomes visible during stylometric analysis.</p><p>There are also several relevant studies that use machine translated texts. This approach underlies Cross Language Authorship Attribution, a concept introduced by Bogdanova and Lazaridou <ref type="bibr" target="#b1">[2]</ref>. Although the quality of automatically translated texts may distort an author's style, translation is one way of bringing texts written in different languages "into one space". Machine translation in this study is combined with a focus on lexical and higher-level features. The best results were achieved by combining machine translation and k-Nearest Neighbors. Based on this study, Mikros and Boumparis <ref type="bibr" target="#b16">[17]</ref> apply the machine translation method for the language pair Greek-English. They conclude that when trained and tested on machinetranslated texts, attribution accuracy is not strongly affected, compared to training and testing on original texts, but that the relevant features differ between the two languages, making crosslinguistic authorship attribution unreliable.</p><p>As mentioned above, the influence of corpus composition on attribution performance has so far not been investigated in any considerable detail. Smith and Aldridge <ref type="bibr" target="#b24">[25]</ref> observe that a chronologically more diverse corpus leads to lower attribution accuracy but do not go into much detail. Kestemont et al. <ref type="bibr" target="#b14">[15]</ref> show that cross-genre attribution is more challenging than within-genre attribution. A challenge for this kind of investigation is the availability of meaningful, high-quality metadata to assess corpus composition.</p><p>Based on this review of relevant research, we can conclude that we should expect some variation in attribution accuracy depending on language. Also, there is a clear need to investigate the role of corpus composition for the accuracy and robustness of stylometric authorship attribution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Research questions and objectives</head><p>The overarching research question that ongoing work of our team addresses is to what extent, how, and under what conditions, both language and corpus influence the performance of stylometric methods of authorship attribution. The objective of our research, therefore, is to provide detailed data evaluating the accuracy and robustness of stylometric methods for corpora of different compositions and in different languages. As a first step, the present paper aims to determine how the results of stylometry are influenced by the language of the texts, when the corpus composition remains the same, that is, using corpora translated into multiple languages. How does accuracy vary across languages? How does it change when a given corpus is translated into a different language? What kind of interaction can we observe between feature selection and language, as well as between sample size?</p><p>In a second step, not covered by the present paper, we aim to determine how corpus composition affects attribution accuracy. In order to address this question, a formalization of corpus composition needs to be designed that quantifies the level of difÏculty, for an attribution task, of a given corpus. How does such a measure need to be designed in order to adequately represent corpus difÏculty in terms of authorship attribution? What types of metadata, for example regarding subgenre, time period or formal aspects, are most strongly correlated to differences in attribution accuracy? Does a higher corpus difÏculty as measured based on metadata correlate with lower attribution accuracy? What is the effect of translation on this relationship? Our preliminary work on this issue is discussed in subsection 7.3 of this paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Data</head><p>The dataset was derived from the European Literary Text Collection (ELTeC; see <ref type="bibr" target="#b23">[24,</ref><ref type="bibr" target="#b2">3]</ref>) in order to build corpora in four languages: French, English, Hungarian, and Ukrainian. The choice of languages is determined by our desire to cover a number of different language groups: Romance, Germanic, Finno-Ugric, and Slavic languages. In this way, one can illustrate the effectiveness of the stylometric method and see how language affects the analysis results, for the four language groups.</p><p>The original ELTeC corpora each include one hundred novels. For our research, a selection of novels was made, retaining novels by 8 to 10 authors, each represented with three novels from ELTeC. This resulted in the following collections of texts: the English and the French corpora include each 30 novels, the Hungarian corpus 27 novels and the Ukrainian corpus consists of 24 novels (see key information on the corpora in Table <ref type="table" target="#tab_0">1</ref>). Each of the corpora was translated into the other three languages, thus the entire corpus includes 16 sub-corpora. The dataset also includes a metadata table for each corpus, with information about each of the novels. The metadata table includes information about the author, year of publication of digital and print editions, language, number of words in the novel, as well as subgenre (social, historical, adventure, detective or sentimental novel, bildungsroman or other) and narrative perspective of the novel (heterodiegetic, homodiegetic, epistolary, dialogue or mixed). This metadata was collected from experts in each of the relevant literary traditions who based their information on a reading of both the novels and of relevant secondary literature.</p><p>Data preparation involved two stages: translation of texts into the other languages and their linguistic annotation. The texts were translated automatically using DeepL Pro. For subsequent analysis, extracting lemmas and POS (part of speech) tags was necessary, a task we accomplished using the SpaCy library (version 3.7) for both original and translated texts. Additionally, unigrams and n-grams (from 2 to 5) were extracted using the stylo package <ref type="bibr" target="#b9">[10]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Method(s)</head><p>Fundamentally, we used an authorship attribution classification task with leave-one-out crossvalidation. Inspired by Rybicki and Eder <ref type="bibr" target="#b18">[19]</ref>, we used grid search over the features space to assess the general performance of authorship attribution and quantified the attribution performance in terms of its accuracy (mean attribution accuracy score for each condition as well as Cohen's Kappa for each condition). The goal was not to optimize for performance, but to cover the space of reasonable approaches that work for different languages and to understand the nature of differences in performance across languages and corpora.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Features</head><p>While in many scenarios, simple word form frequencies still prove effective, other kinds of features can in some settings be advantageous, for example for short texts or agglutinative languages <ref type="bibr" target="#b17">[18]</ref>. Therefore, we did want to vary the feature types to some extent, in order not to miss conditions which work well in one or the other of the less-frequently investigated languages. There are three main levels of variation: types of features, sample size and feature vector length.</p><p>In terms of types of features, we used frequencies of word forms, lemmas, part-of-speech (POS) tags and characters. Each feature was cut to n-grams of different size: 1-3 for words and lemmata, 2-5 for character and POS n-grams (because the number of POS and character unigrams is too limited to be useful). Each n-gram length was tested independently.</p><p>With respect to sample size, frequency-based approaches to authorship attribution naturally depend on the available size of the text. There is a considerable variation in text sizes in EL-TeC (as shown in Table <ref type="table" target="#tab_0">1</ref>); to mitigate this, we draw a random sample of consecutive tokens (a "chunk") for each text based on the shortest text across all corpora (10,000 words). We use test sizes of 5,000 to 10,000 tokens for word-based features, and 10,000 to 50,000 for character-based ones. The number of available tokens per text differs dramatically between words and characters; so sample sizes mean different things for n-grams based on these features. To account for the variability that is introduced with taking only one limited sample out of sometimes very large novels, at each step we take a random consecutive sample out of all available 'chunks' and record performance 100 times. As the last step, we perform classification on full-length texts.</p><p>In terms of feature vector length, vectors that represent texts are constructed based on most frequent feature length cutoffs. We used 50 to 2000 features with incrementally increasing step sizes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Classification task</head><p>For author-based text classification, we use the Support Vector Machine classifier and perform leave-one-out cross-validation. This means that in each step, one text is removed from the data, the model is then trained on the remaining texts, and then the authorship of the left-out text is predicted and the result is recorded. This process continues until all texts have been left out once. For each language combination, we run 100 iterations of leave-one-out cross validation classification, taking a sample from each text (chunks of consecutive sequences of size n) at random. Additionally, we run a single full-text analysis for each corpus.</p><p>We report two performance measures: simple accuracy (proportion of correctly predicted authors) and Cohen's kappa (typically used for measuring inter-annotator agreement <ref type="bibr" target="#b6">[7]</ref>). The latter measure, while highly correlated to accuracy, allows us to partly offset the different amount of classes across corpora, since it provides an inter-rater agreement score adjusted for random classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Visualization</head><p>To create the interactive visualizations of the showcase, we used Bokeh, a Python library offering extensive capabilities in crafting interactive and dynamic data visualizations. The visualization -which is available online at https://showcases.clsinfra.io/stylometry -consists of two heatmaps, enabling users to compare and contrast two sets of results simultaneously. In this article, we focus on the results for the Hungarian corpus, which are presented in subsection 6.2. For results from other corpora, please see section 8 (appendix) and/or visit the online showcase.</p><p>The x-axis represents various settings of the most frequent features (MFF) used in a specific analysis, while the y-axis denotes distinct sample sizes, ranging from shorter text snippets to entire novels ("full novel"). Words and characters are treated differently and analyzed across diverse sample sizes (see subsection 5.1 above). Each heatmap cell correlates MFF and sample size, with the color intensity indicating accuracy levels, offering a visual approach to the accuracy obtained in the analysis. A mouseover provides further information, such as the features used as well as numerical indications of accuracy and Cohen's Kappa. When there is no data corresponding to selected feature combinations, the plots display a "No data available for these selectors" message. The range of MFF values depends on the features; in some cases, as for character and POS bigrams, the theoretical maximum of features is limited.</p><p>Users can engage with the data through several selectors, enhancing the exploration and analysis process. The available choices include:</p><p>1. Corpus selector: Enables the selection from a range of available corpora, facilitating comparative studies; 2. Feature level selector: Permits users to toggle between "words" and "chars" (characters) feature types; 3. Feature type selector: Allows for the choice between plain text, lemmas, or POS tags; 4. Ngram size selector: Offers options to choose n-gram sizes from 1 to 5, allowing for detailed linguistic patterns examination.</p><p>A color scale on the right side of each graph delineates the accuracy levels, ranging from 0 to 1. Here, cooler tones like greens, blues and purples signify lower values, whereas warmer hues, like red and orange, indicate higher values. This color coding serves as a primary indicator of stylometric effectiveness across various parameter settings, vividly illustrating the success rate of stylometric analysis as a function of the chosen corpus and the selected criteria.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Results</head><p>In relation to the main outcomes, we aim to present a concise overview of some of our key findings. First, we focus on evaluating the overall performance of authorship attribution, based on mean accuracy scores across multiple settings, for all corpora. Second, we provide a detailed description of the main outcomes for the Hungarian corpus and its translations. We focus on one language for reasons of space, but readers can find a detailed description of the performance of the other three languages in the appendix (section 8) to this paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Attribution performance across all corpora</head><p>Regarding overall performance, we discovered several parameters that consistently yield the best results across all corpora and translations. At the word level, the following parameters showed the highest scores: unigram for both feature types (plain word forms and lemmas), with only minimal differences. Meanwhile, at the character level, very high performance was observed using 5-grams based on both plain word forms and lemmas as feature types and using full novels as sample size.</p><p>In Figure <ref type="figure" target="#fig_0">1</ref> and Figure <ref type="figure" target="#fig_1">2</ref>, the average classification accuracy at the word level is illustrated, employing unigram plain word forms, with sample sizes ranging from 7,000 to 9,000 and from 10,000 to full novel size, respectively. As observed in the figures, the clear winner among the original texts is the Hungarian corpus. Translations from Hungarian into three other languages also demonstrate very high performance, slightly decreasing to 0.6 and 0.7 from 0.86 compared to original texts when full texts and sample size of 10000 are taken into account. Concerning smaller sample sizes, the decrease is more noticeable, achieving an accuracy of 0.4-0.5, while the accuracy of original texts is 0.8. The second-best performance comes from the original Ukrainian corpus; however, translations from Ukrainian have low performance, falling to 0.5 and 0.4 from the original scores of 0.76 and 0.65 for large and smaller sample sizes respectively, indicating that translations from Ukrainian are particularly challenging for attribution tasks. The best performance among translations from Ukrainian, surprisingly, is shown by the Hungarian language.</p><p>The French and English corpora have almost similar performance when the original texts are analyzed. However, translations from English at the word level perform lower compared to the original texts, and the difference in performance is more pronounced with smaller sample sizes, falling from 0.5 to 0.3-0.4. On the other hand, at the character level, translations into Hungarian outperform original texts with the following settings: 5-gram based on both lemmas and plain word forms.</p><p>Remarkably, the smallest difference in performance between translated texts and the original is observed in the French corpus, where the difference is smaller than 0.1 for large sample sizes and only slightly greater than 0.1 for smaller sample sizes. It appears that the authors' style is best preserved in translations from French. All three translations from French demonstrate similar performance; however, translations into English slightly outperform the other translations.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Hungarian corpus attribution performance</head><p>As already discussed earlier, the Hungarian corpus (Figure <ref type="figure" target="#fig_2">3</ref>) demonstrates the highest accuracy scores overall, reaching approximately 80-95% across many sample sizes; starting with samples of 8000 words, results are reliably very accurate. This achievement is realized by employing the following settings: the number (N) of the most frequent words ranges from 50 to 700, with the analysis based on unigram plain word forms. Although slightly less effective, classification based on unigram lemmas still produces high scores. Regarding the analysis using character n-grams as features, the performance remains remarkably high, fluctuating between 70-93% for sample sizes starting from 4000, with the most remarkable performance observed when 5-gram character sequences are used with full novels (Figure <ref type="figure" target="#fig_3">4</ref>). In this scenario, the attribution based on characters, derived from word forms with the number of MFF from 50 to 500, maintains a constant accuracy rate of over 90% for full novels.</p><p>When evaluating the accuracy of translations from Hungarian into other languages, it becomes evident that the performance is noticeably lower compared to original novels. One of the few configurations displaying high accuracy (over 90%) is observed in translations into Ukrainian utilizing 5-gram characters based on plain word forms. For lemmas, the Ukrainian translation, in some cases, outperforms the original texts (Figure <ref type="figure" target="#fig_4">5</ref>). As for the attribution at the word level, high accuracy is observable only for full novel sample sizes. Translations into English and Ukrainian exhibit very similar performance, whereas French displays lower accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>In conclusion, we would like to summarize some of the key findings of this study, but also discuss a number of limitations and next steps for research in this area.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.">Findings</head><p>We can discern consistent patterns across all languages and translations when analyzing their performance. These trends remain mostly consistent whether we are examining analyses based on words, lemmas or character n-grams.</p><p>Optimal performance consistently emerges when the analysis is based on entire novels rather than shorter samples. When using samples, results tend to improve as the size of the analyzed samples increases. This finding is consistent with previous studies <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9]</ref>.</p><p>Conversely, increasing the number of features used for attribution beyond a certain value tends to lead to a deterioration in classification scores in most cases. The decline in performance becomes noticeable at full novels and word unigrams beyond roughly 300-800 most frequent words for French, English, and Hungarian (but this is not the case for Ukrainian). Furthermore, the impact is considerably less pronounced for word bigrams and trigrams. This finding is consistent with results from distance-based attribution studies when using Burrows' Delta, but not when using Cosine or Cosine Delta <ref type="bibr" target="#b24">[25,</ref><ref type="bibr" target="#b10">11]</ref>. This phenomenon may be caused either by an increasing proportion of features driven by topic rather than authorship, or by an increased amount of noise in the signal, when using higher numbers of features. Performance for corpora is always better when they are analyzed in their original language, rather than in a translation into another language. However, this might be due to the fact that we have used a unique translator, as it were, this fact being a major limitation of our study (see below).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.">Limitations</head><p>The finding, mentioned above, that the translated corpora always perform less well, in our experiments, than the original corpora, may very well be an artefact of our method of translating these texts. Given that we used DeepL for all translations, the translations could be said to have all been produced by the same 'agent', leading to a certain homogenization of their style and lexicon, which we would expect to be a challenge for stylometric authorship attribution. In contrast, translations of literary texts are usually performed by a range of different translators, some of which maintain privileged relationships with certain authors, in a constellation where authorial and translatorial signal may reinforce each other. At the same time, the question of the (in)visibility of the translators in stylometry has been addressed several times <ref type="bibr" target="#b20">[21,</ref><ref type="bibr" target="#b15">16]</ref>.</p><p>Another limitation with respect to the corpora is that for each language, we have only one particular corpus available. This means that, given how important the genre of a corpus as well as the corpus composition are likely to be, we cannot generalize our results based on these particular corpora to the languages in general. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3.">Next steps</head><p>As mentioned above, the research presented here only covers one of two aspects that we consider relevant to our overall research question. Our corpora differ between each other both in language and in composition. Therefore, the investigation of the influence of language and of translation described here needs to be complemented by an analogous investigation into the influence of corpus composition on the results.</p><p>We have already made significant progress in this investigation, notably by collecting relevant metadata about all four corpora and by computationally modeling the relationship between corpus composition and attribution performance. With respect to the metadata, we have collected information about the year of publication, the narrative form and the subgenre for each novel in each corpus. To enable cross-corpus application of the subgenre labels, they come from a small, closed set of eight different, relatively broad categories, including one category called 'other'. With respect to the computational model of the relation between corpus composition and attribution accuracy, we have fitted a multiple generalized Bayesian linear model that takes into account within-author and between-author variability and allows us to assess the positive or negative influence of metadata differences on textual distinctiveness (and, therefore, stylometric performance). However, further work on this issue is needed and presenting this work warrants a separate publication.</p><p>Apart from this, our research opens up several avenues for future research, in order to attempt to close some of the gaps just described in the section on limitations. For instance, the same kind of analysis could be performed using several corpora of different genres, size and composition for each language, in order to assess the degree of within-language variation and the robustness of differences between languages that we have observed in this study. In addition, a study could be conceived that varies corpus composition deliberately and systematically, based on a significantly larger corpus with document-level metadata (at least) on subgenre and narrative perspective.  Regarding the classification performance when using character n-grams, we observe a similar trend as seen in the case of the French original corpus, where high accuracy is evident only for full novels. However, for the English corpus, the performance is slightly lower, reaching 86% in the best cases (Figure <ref type="figure" target="#fig_10">12</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Data and code</head><p>Importantly, the Hungarian translation of the English corpus (Figure <ref type="figure" target="#fig_11">13</ref>) approximates the performance of the original novels, and in some cases even outperforms the original novels. The highest accuracy of up to 96% (lemma-based analysis) and 90% (word forms-based) is achieved for full novel sample size at the word level. However, the performance decreases rapidly with an increase in the number of features and decrease in sample sizes.</p><p>The translation into Ukrainian (Figure <ref type="figure" target="#fig_12">14</ref>) also demonstrates high performance, up to 86% for full novels, although the average accuracy is lower compared to the original corpus and the translation into Hungarian (Figure <ref type="figure" target="#fig_11">13</ref>). On the other hand, the translation into French achieves best performances at around 84% for lemma-based analysis, with most cases hovering around 20-30% (Figure <ref type="figure" target="#fig_4">15</ref>). Regarding the analysis of translations at the character level, Hungarian translation outperforms the original corpus, in full novels sample size setting, while French translations exhibit lower accuracy even for full novels, reaching a score of 80% only for the 1500 and 2000 features setting for word forms and lemma-based analysis.      </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Mean accuracy across each corpus with the following parameters: word forms, word level, ngram = 1, sample size = 7000, 8000, 9000. Number of most frequent features: 100-1000.</figDesc><graphic coords="10,169.18,103.19,265.76,221.47" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Mean accuracy across each corpus with the following parameters: word forms, word level, ngram = 1, sample size = 10000 and full novels. Number of most frequent features: 100-1000.</figDesc><graphic coords="10,169.18,403.98,265.76,221.47" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Mean accuracy scores for the Hungarian corpus in the original language, for different settings of MFF and sample size. Parameters: Plain word form unigrams.</figDesc><graphic coords="12,109.24,152.46,387.85,158.07" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Mean accuracy scores for the Hungarian corpus in the original language, for different settings of MFF and sample size. Parameters: Character 5-grams of lemmas.</figDesc><graphic coords="13,109.24,151.07,387.85,160.84" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Mean accuracy scores for the Hungarian corpus translated into Ukrainian, for different settings of MFF and sample size Parameters: Character 5-grams of lemmas.</figDesc><graphic coords="14,109.24,151.87,387.85,159.24" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Ukrainian corpus, results for word form unigrams.</figDesc><graphic coords="17,109.24,152.20,387.85,158.59" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 8 :</head><label>8</label><figDesc>Figure 8: English translations of Ukrainian texts, performance for word form unigram.</figDesc><graphic coords="19,109.24,152.66,387.85,157.67" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 9 :</head><label>9</label><figDesc>Figure 9: Results of French original novels, based on lemma unigrams.</figDesc><graphic coords="20,109.24,167.53,387.85,157.67" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 10 :</head><label>10</label><figDesc>Figure 10: Results of French original novels, based on character 5-grams.</figDesc><graphic coords="20,109.24,468.32,387.85,157.67" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head>Figure 11 :</head><label>11</label><figDesc>Figure 11: Results of the English original corpus, based on lemma unigrams.</figDesc><graphic coords="21,109.24,166.59,387.85,159.56" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_10"><head>Figure 12 :</head><label>12</label><figDesc>Figure 12: Results of the English original corpus, based on character 5-grams.</figDesc><graphic coords="21,109.24,468.29,387.85,157.74" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_11"><head>Figure 13 :</head><label>13</label><figDesc>Figure 13: Performance of translations from English into Hungarian, lemma unigrams.</figDesc><graphic coords="22,109.24,166.82,387.85,159.10" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_12"><head>Figure 14 :</head><label>14</label><figDesc>Figure 14: Performance of translations from English into Ukrainian, lemma unigrams.</figDesc><graphic coords="22,109.24,468.41,387.85,157.49" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="18,109.24,152.66,387.85,157.67" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="23,109.24,317.31,387.85,158.92" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Key information on the four corpora used in this study (limited to the corpora in their original languages; we expect numbers of types and tokens to vary in the translations.)</figDesc><table><row><cell></cell><cell>fra</cell><cell>eng</cell><cell>hun</cell><cell>ukr</cell></row><row><cell>number of authors number of novels total number of types total number of tokens median number of tokens shortest novel, in tokens longest novel, in tokens earliest novel, year latest novel, year</cell><cell cols="4">10 30 68,783 2,414,226 4,791,903 2,635,112 824,130 10 9 8 30 27 24 96,603 256,867 99,870 75,218.5 148,004.5 76,646.0 20,008.5 29,460 23,459 14,590 9,517 149,810 348,793 285,818 102,859 1840 1844 1842 1841 1900 1912 1908 1919</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>The research presented here as been performed as part of CLS INFRA, a project supported by funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 101004984.</p></div>
			</div>


			<div type="availability">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Data and code are available online at https://gitlab.clsinfra.io/cls-infra/d33 (DOI: https: //doi.org/10.5281/zenodo.11080205).</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix</head><p>Note that all result data can be inspected online in our interactive visualization at https://show cases.clsinfra.io/stylometry. A selection of findings is documented here for convenience.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Ukrainian</head><p>The Ukrainian corpus demonstrates lower performance compared to the Hungarian original corpora, achieving accuracies ranging from 50-80% when using unigram lemmas. Analyses based on word forms yields higher (Figure <ref type="figure">6</ref>, 60-85%), particularly when the number of features is not higher than 400-500. An interesting observation is that variation in accuracy is a bit less across parameters in the Ukrainian original texts than for other corpora. When evaluating the accuracy of character-level analysis, high scores (70-85%) are observed using 3-5-gram characters with full novel sample size in word forms and lemmas settings (Figure <ref type="figure">7</ref>).</p><p>Translations of the Ukrainian corpus into other languages demonstrate significantly lower results than the original texts, with accuracies surpassing 50% only for full novel sizes in wordlevel analysis with unigram word forms and lemmas. Among the translations from Ukrainian, the English translation shows the highest results, reaching up to 75% for full novel sample size (Figure <ref type="figure">8</ref>). This observation holds true for analyses based on full novel sample size using character n-grams on translations as well. In most cases, the performance of translation into English achieves 80% accuracy for 5-gram characters in both word forms and lemmas. Concerning the mean performance, the translation into Hungarian has the highest mean accuracy in most settings. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. French</head><p>The French corpus shows a bit lower average performance compared to Ukrainian and Hungarian, with accuracy ranging from 40-70%, and in some cases, up to 80-95% for original texts based on unigram word forms and lemmas with N features up to 550 and a sample size starting from 8000 (Figure <ref type="figure">9</ref>). Notably, the mean accuracy of original texts based on unigram word forms with sample size of 10000 and full novels is very high and reaches 83%, which approximates the performance of the Hungarian original corpus.</p><p>When discussing character n-gram features, we observe high scores (up to 80-95%) exclusively for full novels, with n-gram sizes ranging from 3-5 (Figure <ref type="figure">10</ref>). This observation holds true for both lemmas and plain word forms.</p><p>Translations from French show a moderate decrease in performance compared to the original texts in both lemma-and word-form-based analyses, maintaining high accuracy only for the full novel sample size. Regarding the character n-grams, the scores of translations are almost identical to those of the original novels, with high performance achieved only for the full novel sample size.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. English</head><p>The English corpus (Figure <ref type="figure">11</ref>) yields the lowest performance among all corpora for original texts, with accuracy occasionally reaching 93% only on the full novel sample size in unigram lemma-based analysis at the word level. Apart from these instances, the corpus's performance displays high variability, including the lowest results around 20% for both lemma-based and </p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Interpreting Burrows&apos;s Delta: Geometric and Probabilistic Foundations</title>
		<author>
			<persName><forename type="first">S</forename><surname>Argamon</surname></persName>
		</author>
		<idno type="DOI">10.1093/llc/fqn003</idno>
	</analytic>
	<monogr>
		<title level="j">Literary and Linguistic Computing</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="131" to="147" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Cross-Language Authorship Attribution</title>
		<author>
			<persName><forename type="first">D</forename><surname>Bogdanova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lazaridou</surname></persName>
		</author>
		<ptr target="http://www.lrec-conf.org/proceedings/lrec2014/pdf/145%5C%5FPaper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC&apos;14)</title>
				<meeting>the Ninth International Conference on Language Resources and Evaluation (LREC&apos;14)<address><addrLine>Reykjavik: Elra</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="2015" to="2020" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">In Search of Comity: TEI for Distant Reading</title>
		<author>
			<persName><forename type="first">L</forename><surname>Burnard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Odebrecht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schöch</surname></persName>
		</author>
		<idno type="DOI">10.4000/jtei.3500</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of the Text Encoding Initiative</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">All the Way Through: Testing for Authorship in Different Frequency Strata</title>
		<author>
			<persName><forename type="first">J</forename><surname>Burrows</surname></persName>
		</author>
		<idno type="DOI">10.1093/llc/fqi067</idno>
	</analytic>
	<monogr>
		<title level="j">Literary and Linguistic Computing</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="27" to="47" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">The Englishing of Juvenal: Computational Stylistics and Translated Texts</title>
		<author>
			<persName><forename type="first">J</forename><surname>Burrows</surname></persName>
		</author>
		<idno type="DOI">10.5325/style.36.4.677</idno>
		<ptr target="https://www.jstor.org/stable/10.5325/style.36.4.677" />
	</analytic>
	<monogr>
		<title level="j">Style</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="issue">4</biblScope>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Analysis in authorship attribution</title>
		<author>
			<persName><forename type="first">J</forename><surname>Byszuk</surname></persName>
		</author>
		<idno type="DOI">10.5281/zenodo.7782363</idno>
	</analytic>
	<monogr>
		<title level="m">Survey of Methods in Computational Literary Studies</title>
				<editor>
			<persName><forename type="first">C</forename><surname>Schöch</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Dudar</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Fileva</surname></persName>
		</editor>
		<meeting><address><addrLine>Trier</addrLine></address></meeting>
		<imprint>
			<publisher>Cls Infra</publisher>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A CoefÏcient of Agreement for Nominal Scales</title>
		<author>
			<persName><forename type="first">J</forename><surname>Cohen</surname></persName>
		</author>
		<idno type="DOI">10.1177/001316446002000104</idno>
	</analytic>
	<monogr>
		<title level="j">Educational and Psychological Measurement</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="37" to="46" />
			<date type="published" when="1960">1960</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Does size matter? Authorship attribution, small samples, big problem</title>
		<author>
			<persName><forename type="first">M</forename><surname>Eder</surname></persName>
		</author>
		<idno type="DOI">10.1093/llc/fqt066</idno>
	</analytic>
	<monogr>
		<title level="j">Digital Scholarship in the Humanities</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="167" to="182" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Short Samples in Authorship Attribution: A New Approach</title>
		<author>
			<persName><forename type="first">M</forename><surname>Eder</surname></persName>
		</author>
		<ptr target="https://dh2017.adho.org/abstracts/341/341.pdf" />
	</analytic>
	<monogr>
		<title level="m">Book of Abstracts of the Digital Humanities Conference</title>
				<meeting><address><addrLine>Montréal</addrLine></address></meeting>
		<imprint>
			<publisher>Adho</publisher>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Stylometry with R: A Package for Computational Text Analysis</title>
		<author>
			<persName><forename type="first">M</forename><surname>Eder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rybicki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kestemont</surname></persName>
		</author>
		<idno type="DOI">10.32614/rj-2016-007</idno>
	</analytic>
	<monogr>
		<title level="j">The R Journal</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">107</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Understanding and explaining Delta measures for authorship attribution</title>
		<author>
			<persName><forename type="first">S</forename><surname>Evert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Proisl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Jannidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Reger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pielström</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schöch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Vitt</surname></persName>
		</author>
		<idno type="DOI">10.1093/llc/fqx023</idno>
	</analytic>
	<monogr>
		<title level="j">Digital Scholarship in the Humanities</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="issue">2</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Universal Dependencies and Author Attribution of Short Texts with Syntax Alone</title>
		<author>
			<persName><forename type="first">R</forename><surname>Gorman</surname></persName>
		</author>
		<ptr target="https://www.digitalhumanities.org/dhq/vol/16/2/000606/000606.html" />
	</analytic>
	<monogr>
		<title level="j">Digital Humanities Quarterly</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="issue">2</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">New Machine Learning Methods Demonstrate the Existence of a Human Stylome</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">V</forename><surname>Halteren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Baayen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Tweedie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Haverkort</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neijt</surname></persName>
		</author>
		<idno type="DOI">10.1080/09296170500055350</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Quantitative Linguistics</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="65" to="77" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">The Evolution of Stylometry in Humanities Scholarship</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">I</forename><surname>Holmes</surname></persName>
		</author>
		<idno type="DOI">10.1093/llc/13.3.111</idno>
	</analytic>
	<monogr>
		<title level="j">Literary and Linguistic Computing</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="111" to="117" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Cross-Genre Authorship Verification Using Unmasking</title>
		<author>
			<persName><forename type="first">M</forename><surname>Kestemont</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Luyckx</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Daelemans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Crombez</surname></persName>
		</author>
		<idno type="DOI">10.1080/0013838x.2012.668793</idno>
	</analytic>
	<monogr>
		<title level="j">English Studies</title>
		<imprint>
			<biblScope unit="volume">93</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="340" to="356" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Do language combinations affect translators&apos; stylistic visibility in translated texts?</title>
		<author>
			<persName><forename type="first">C</forename><surname>Lee</surname></persName>
		</author>
		<idno type="DOI">10.1093/llc/fqx056</idno>
	</analytic>
	<monogr>
		<title level="j">Digital Scholarship in the Humanities</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="592" to="603" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Cross-linguistic Authorship Attribution and Author Profiling. Machine translation as a method for bridging the language gap</title>
		<author>
			<persName><forename type="first">G</forename><surname>Mikros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Boumparis</surname></persName>
		</author>
		<idno type="DOI">10.1093/llc/fqae028</idno>
	</analytic>
	<monogr>
		<title level="j">Digital Scholarship in the Humanities</title>
		<imprint>
			<biblScope unit="volume">39</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="954" to="967" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">A Theory of Linguistic Individuality for Authorship Analysis</title>
		<author>
			<persName><forename type="first">A</forename><surname>Nini</surname></persName>
		</author>
		<idno type="DOI">10.1017/9781108974851</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
			<publisher>Cambridge University Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Deeper Delta across genres and languages: do we really need the most frequent words?</title>
		<author>
			<persName><forename type="first">J</forename><surname>Rybicki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Eder</surname></persName>
		</author>
		<idno type="DOI">10.1093/llc/fqr031</idno>
	</analytic>
	<monogr>
		<title level="j">Literary and Linguistic Computing</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="315" to="321" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">A Third Glance at a Stylometric Map of Native and Translated Literature in Polish</title>
		<author>
			<persName><forename type="first">J</forename><surname>Rybicki</surname></persName>
		</author>
		<idno type="DOI">10.4324/9780429325366-20</idno>
	</analytic>
	<monogr>
		<title level="m">Retracing the History of Literary Translation in Poland</title>
				<meeting><address><addrLine>New York</addrLine></address></meeting>
		<imprint>
			<publisher>Routledge</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="247" to="261" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">The great mystery of the (almost) invisible translator: Stylometry in translation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Rybicki</surname></persName>
		</author>
		<idno type="DOI">10.1075/scl.51.09ryb</idno>
	</analytic>
	<monogr>
		<title level="m">Studies in Corpus Linguistics</title>
				<editor>
			<persName><forename type="first">M</forename><forename type="middle">P</forename><surname>Oakes</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Ji</surname></persName>
		</editor>
		<meeting><address><addrLine>Amsterdam</addrLine></address></meeting>
		<imprint>
			<publisher>John Benjamins</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="volume">51</biblScope>
			<biblScope unit="page" from="231" to="248" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">The stylistics and stylometry of collaborative translation: Woolf&apos;s Night and Day in Polish</title>
		<author>
			<persName><forename type="first">J</forename><surname>Rybicki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Heydel</surname></persName>
		</author>
		<idno type="DOI">10.1093/llc/fqt027</idno>
	</analytic>
	<monogr>
		<title level="j">Literary and Linguistic Computing</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="708" to="717" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Survey of Methods in Computational Literary Studies</title>
		<author>
			<persName><surname>Fileva</surname></persName>
		</author>
		<idno type="DOI">10.5281/zenodo.7892112</idno>
	</analytic>
	<monogr>
		<title level="j">Cls Infra</title>
		<editor>C. Schöch, J. Dudar, and E.</editor>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives</title>
		<author>
			<persName><forename type="first">C</forename><surname>Schöch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Patraș</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Santos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Erjavec</surname></persName>
		</author>
		<idno type="DOI">10.3828/mlo.v0i0.364</idno>
	</analytic>
	<monogr>
		<title level="m">Modern Languages Open</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Improving Authorship Attribution: Optimizing Burrows&apos; Delta Method</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">W H</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Aldridge</surname></persName>
		</author>
		<idno type="DOI">10.1080/09296174.2011.533591</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Quantitative Linguistics</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="63" to="88" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">A survey of modern authorship attribution methods</title>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<idno type="DOI">10.1002/asi.21001</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Society for Information Science and Technology</title>
		<imprint>
			<biblScope unit="volume">60</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="538" to="556" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">On the robustness of authorship attribution based on character n-gram features</title>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<ptr target="https://brooklynworks.brooklaw.edu/jlp/vol21/iss2/7/" />
	</analytic>
	<monogr>
		<title level="j">Journal of Law and Policy</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="page">2</biblScope>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
