DT-grams: Structured Dependency Grammar Stylometry for Cross-Language Authorship Attribution Benjamin Murauer Günther Specht Universität Innsbruck, Austria Universität Innsbruck, Austria b.murauer@posteo.de guenther.specht@uibk.ac.at ABSTRACT over time. Therefore, language-independent alternatives to Cross-language authorship attribution problems rely on ei- traditional attribution features are crucial for cross-language ther translation to enable the use of single-language features, attribution without translation. or language-independent feature extraction methods. Until Candidates for such features include high-level measure- recently, the lack of datasets for this problem hindered the ments like vocabulary or punctuation statistics [11] or fea- development of the latter, and single-language solutions were tures that can be mapped to a general space like universal performed on machine-translated corpora. In this paper, grammar representations [1]. In this paper, our first con- we present a novel language-independent feature for author- tribution is a novel type of classification feature, DT-grams ship analysis based on dependency graphs and universal part (dependency tree grams), that is based on dependency graphs of speech tags, called DT-grams (dependency tree grams), and universal part-of-speech (POS) tags, making it language- which are constructed by selecting specific sub-parts of the independent. It calculates frequencies of substructures within dependency graph of sentences. We evaluate DT-grams by a dependency graph similar to how in traditional n-grams, performing cross-language authorship attribution on untrans- frequencies of character or word combinations in the original lated datasets of bilingual authors, showing that, on average, text are counted. We show that this feature is efficient for they achieve a macro-averaged F1 score of 0.081 higher than cross-language authorship attribution, a problem in which previous methods across five different language pairs. Ad- documents of bilingual authors are classified, but the lan- ditionally, by providing results for a diverse set of features guage differs between training and testing documents. In our for comparison, we provide a baseline on the previously un- experiments, DT-grams outperform other approaches in this documented task of untranslated cross-language authorship field consistently by an average F1macro score of 0.081. attribution. For the authorship attribution experiment, we use a dataset consisting of social media comments of bilingual authors in multiple language pairs. This distinguishes this work from 1. INTRODUCTION previous research, which used artificially constructed cor- In cross-language authorship attribution, the true author of pora due to the lack of data from multilingual authors [1, 7]. a previously unseen document must be determined from a set Thereby, classic novels by professional authors were used as of candidate authors after training a model with documents training data, and human-translated versions of other novels from those candidates in a different language. Previous by the same author are used as evaluation data. Although work in single-language attribution often relies on language- research has shown that human translation does not elimi- specific features. Here, popular and powerful features often nate stylometric features [20], the original author still has exploit character- and word-based measures [16, 2]. Using only written in one language. Therefore, we argue that the translation enables easy re-use of these features, and has been classification problem is, more strictly speaking, a translation shown to be a useful tool for cross-language attribution [1]. obfuscation measurement rather than an authorship attribu- However, setting up a custom machine translation system is tion problem. By performing our evaluation experiments on an expensive operation in terms of time and resources. From the untranslated data from bilingual authors, we add a sec- a scientific perspective, translations from commercial and ond contribution to this paper by providing the first baseline therefore, closed-source systems are difficult to explain and for true, untranslated cross-language authorship attribution. reproduce, as the details of the models are unknown to the Summarized, our contribution in this paper is twofold: (1) customer and commercial providers will likely try to improve we present a new feature type DT-grams for cross-language their models, causing different translations of the same input authorship analysis, and (2) our evaluations represent a base- line for the novel problem of true, untranslated cross-language authorship attribution. To ensure the reproducibility of our results, all of our data and code is published online1 . 2. RELATED WORK Cross-language authorship analysis is a significantly more 32nd GI-Workshop on Foundations of Databases (Grundlagen von Daten- difficult problem than its single-language version [16], and banken), September 01-03, 2021, Munich, Germany. Copyright © 2021 for this paper by its authors. Use permitted under Cre- 1 ative Commons License Attribution 4.0 International (CC BY 4.0). https://git.uibk.ac.at/csak8736/gvdb2021-code in many cases, know-how learned from single-language au- nmod thorship analysis can’t be directly used. For example, simple dobj case syntactic features like word or character n-grams are an effec- nsubj det det det tive feature for stylometry [5], but are not suitable when the training and testing documents only share a few words, or even characters when given a different alphabet. Generally, the cat saw a mouse in the field using grammar features for authorship classification has been DET NOUN VERB DET NOUN ADP DET NOUN proven effective in many tasks ranging from attribution [9, 21, 4] to plagiarism detection [19]. Although these examples Figure 1: Dependency graph representation of the sentence use language-specific grammar features in single-language ‘the cat saw a mouse in the field’. settings, they show the general ability of these features to distinguish authorship, and language-independent grammar features such as universal POS tags allow for cross-language 3. DT-GRAMS CONSTRUCTION classification [1]. To construct the proposed DT-grams feature, we parse Using different combinations of words by leveraging the de- textual data to obtain dependency relationships between the pendency of sentences rather than the original word order has words within sentences, which are then mapped to a tree lead to increased classification performance [15]. However, structure. Then, differently sized substructures are selected this study does not make use of language-independent fea- from those trees to produce sequences of DT-grams. Finally, tures but rather changes how word n-grams are constructed while some classification models used in our experiments by providing an alternative measure of which words neighbor use these sequences directly, we also reduce them to tf/idf- each other. Nevertheless, their findings suggest that the normalized frequencies to form a bag-of-DT-grams for other dependency relationships between words within sentences models used in the evaluation. In the following section, these hold valuable information for authorship analysis. steps are explained in detail. Our proposed feature, DT-grams, leverages key findings of previous observations by combining language-independent 3.1 Grammar Representations universal POS tags in combination with dependency graphs. In the first step, the raw text is parsed by a dependency Previous attempts at cross-language attribution define the parser. For this, we use the stanza 2 python library. This pro- task itself inconsistently and different approaches to this duces graphs as depicted in Figure 1. Along with the depen- term are taken, including datasets of monolingual authors dency graph, the parser also provides additional information of different languages [17] or comparing the performance for each word, including its lemma and universal POS tag. of feature families in mono-lingual attribution problems for The latter is a mapping from the more fine-grained language- different languages [2]. When refining the definition of cross- dependent POS tag to a coarse, but language-independent language attribution as the task of attributing authors that universal tag [12], and we use it as a supplemental represen- have written documents in multiple languages, and training tation of the word itself and by discarding the original word. and testing documents must be written in different languages, This way, we construct a language-independent tree from few existing studies remain: [1] use a variety of different the graph of each sentence, and encode both the relationship features including the frequency of universal POS tags on between the words as well as their grammatical role. attribution, but conclude that machine-translation followed We test three different representations of the nodes within by traditional attribution techniques provides the best results. the tree which are depicted in Figure 2: (1) the name of the [7] use differently sized windows in which vocabulary richness incoming dependency (Figure 2a), (2) the universal POS tag measurements are aggregated. However, in both works, the of the word (Figure 2b), and (3) both (Figure 2c). This way, datasets that were used contain human-translated novels, we hope to gain insight into which parts of the dependency where the original author only wrote in one language and the graph are more important for authorship stylometry. The source of the other languages was added by using translations resulting influence of these choices is discussed in Section 5. of these works. Although it has been shown that translation A similar representation of sentences can be achieved by keeps stylistic features mostly intact [20], we claim that the using constituency parsers, which we refrained from using setup by these studies more likely measures the extent to for two reasons: firstly, the availability of parser models for which the authorship was obfuscated by the translator rather non-English languages is limited, and secondly, the result- than the authorship itself. We state that authors writing in ing constituents are not language-independent and a global multiple languages are likely to do so in different styles, and mapping must be used in order to perform cross-language we distinguish this problem as a different type of task. classification. While such mappings exist for POS tags [12], Therefore, in this paper, we use social media texts that no similar resources for constituents are available to our have been written by bilingual authors [10]. While this knowledge. change in text type makes it more difficult to compare the results directly to previous work, it also allows us to analyze 3.2 Tree Substructure Representations a more comprehensive set of language pairs that are available within this resource, and have not been included in previous Along the lines of [19], we use patterns of tree structures studies due to the lack of data. More importantly though, by representing parts of the dependency tree. We propose sev- using this resource, our evaluations of the DT-grams feature eral patterns, which we collectively call DT-grams and which along with several previously established baseline features are displayed in Figure 3. The intention behind choosing provide first reference results for untranslated authorship these specific structures is as follows: We first extract node attribution in five different language pairs. combinations from direct ancestors (DTanc , Figure 3a) and 2 https://github.com/stanfordnlp/stanza dobj blue=3 red=2 NOUN det nmod DET NOUN case det ADP DET (a) Dependency name (b) Universal POS tag (a) DTanc (ancestors) (b) DTsib (siblings) NOUN#dobj blue=2 blue=2 red=3 red=3 DET#det NOUN#nmod ADP#case DET#det (c) Concatenation (c) DTpq (PQ-grams) (d) DTinv (inverted PQ) Figure 2: Three node representations of the dependency graph of the subphrase ”mouse in the field” from Figure 1 Figure 3: DT-grams. Substructures are based on simple tree containing the name of the dependency (a), the universal building blocks (a, b), PQ-grams by [19] (c) and an inverted POS tag (b), and both (c). form thereof (d). siblings (DTsib , Figure 3b), representing the most basic build- Languages A Docs Ldoc D/Amin ing blocks of a tree. In Figure 3c, DTpq is displayed, based on EN + DE 10 2,790 3,055 22 + 20 the PQ-grams used by [19]. Finally, we add DTinv that use EN + DeepL 10 2,790 3,055 22 + 20 a different order of sibling/ancestor relationship (Figure 3d) EN + ES 20 3,402 3,148 20 + 21 compared to PQ-grams. EN + PT 37 4,481 2,996 20 + 20 While character and word-based n-grams only have one EN + NL 11 2,056 3,225 20 + 20 dimension to scale (namely, n), these tree substructures can EN + FR 45 7,374 3,142 21 + 20 have more. In general, two parameters control the number of siblings (red) and ancestors (blue) taken into account for each pattern, whereas DTanc and DTsib both only have one Table 1: Datasets used for evaluation. A denotes the number of those parameters each. For DTanc and DTsib , setting the of authors. Ldoc denotes the average document length in parameter to 1 results in calculating POS tag unigramsd. characters. D/Amin denotes the minimum number of doc- To get instances of the DT-gram patterns from a tree, the uments written by each author in the respective languages substructure patterns are moved across the tree similar to a in the first column. “DeepL” corresponds to the German sliding-window, generating an instance of the substructure documents machine-translated to English with DeepL. at every step. Thereby, one has to define an order in which the DT-grams are parsed from the trees (i.e., depth-first or breadth-first). If a substructure does not fit onto a certain available to our knowledge, we use the framework by [10] to position of a tree, the empty spots in the pattern are filled generate several datasets by bilingual authors in different lan- with a wildcard element X. Thereby, an instance is generated guages. It collects user comments from the social media site for every step as long as at least one of the substructure’s Reddit and allows us to set minimum requirements for docu- positions is filled with a non-wildcard node. ment count, length, and language. We use this resource to This way, the sequence of DT-grams can either be used evaluate the performance of DT-grams for different language directly as input for a sequence-based model (e.g., a recurrent pairs and generate bilingual datasets for the combinations network), or the frequencies of the parsed instances can be presented in Table 1. We choose five different language pairs used analogously to those of character or word n-grams. which all contain English, which represents the largest por- For example, applying DTanc shown in Figure 3a with tion of text in Reddit comments. The other languages were its parameter set to 3 to the tree in Figure 2a results in chosen as they represent the largest non-English text sources 11 substructures: X-X-dobj, X-dobj-det, dobj-det-X, det-X- for this corpus. We set the parameters of the generation X, X-dobj-nmod, dobj-nmod-case, nmod-case-X, case-X-X, framework to produce corpora with at least 10 authors for dobj-nmod-det, nmod-det-X, det-X-X, each pair, where each author has at least 20 documents for Finally, the frequency of each produced instance is counted both languages. To increase the quality of the text docu- over the entire document, and these frequencies are then ments, we also required a minimum document length of 3,000 tf/idf-normalized over the entire dataset. characters. The tools that generate these corpora perform preprocessing including replacing URLs with a tag or filtering messages that mainly consist of punctuation. For a 4. EVALUATION full list of preprocessing steps, we refer to the original pub- To evaluate the DT-grams feature, we perform cross- lication by [10]. We performed no additional preprocessing. language authorship attribution using data from multiple The resulting corpora are shown in Table 1 and we provide language pairs and different classifiers, and we compare the them publicly for download3 . results to different baseline features. In previous work, mono-lingual attribution techniques on machine-translated documents outperform cross-language 4.1 Datasets 3 Since there are no untranslated cross-language corpora https://git.uibk.ac.at/csak8736/gvdb2021-code Parameter Values LIFE linear SVM n-gram size 1-3 Documents DT-grams DT-gram structure DTanc , DTsib , DTpq , DTinv frequencies XGBoost DT-gram dim. sizes 1 – 4, 1 – 4 word n-grams C-value of SVM 0.1, 1, 10 Doc2Vec + LR Doc2Vec emb. size 50, 100, ..., 250 univ. POS n-grams sequences CNN batch size 5, 10, 20 character n-grams CNN Table 2: Hyperparameters optimized by grid search. All Figure 4: Models used in the experiments. n-gram sizes were tested individually for word, character and universal POS-tag n-grams. techniques [1]. We therefore provide data to calculate such a baseline by using the commercial translation service DeepL4 n-grams, whereby n ranges from 1 to 5. to translate the German documents to English, creating a Secondly, we utilize the Doc2Vec document embedding mono-lingual version of the German documents for compari- technique in combination with a logistic regression classifier, son. However, due to budgetary reasons, we only perform as proposed by [3]. For this solution, we have to define what a this step for one randomly picked language (German). document is in terms of DT-grams, as their order is no longer For each language pair pA, Bq, we conduct all experiments well-defined. We interpret each document as the sequence both with training on A and testing on B, as well as the of DT-grams that is returned by the parser, which in our other way around. case uses a depth-first approach. We include baselines for comparison along the lines of [3], which consist of character, 4.2 Evaluation Strategy word, and universal POS n-grams ranging from n=1 to 5. Since the parameterized datasets only define lower limits Thirdly, we use a convolutional neural network proposed for the number of documents per author and the size of these in [14] by interpreting each DT-gram as a unique token used documents, the resulting datasets have varying amounts in the embedding layer of the network. Thereby, we use the of documents and authors. We ensure that results from same parameters and network layout as in [14], except for an experiments using these datasets can be easily compared increased embedding layer size to fit the larger documents. to each other by only selecting 10 random authors of each We utilize the same depth-first order as in the second ap- dataset, and selecting 10 random documents of each language proach to define a sequence of tokens. The baseline for this from those authors. model uses character, word, and universal POS tag unigram To reduce bias, each of these evaluations is repeated 10 representations of the documents. times, and the selected authors and documents are random- As a further comparison baseline, we compute the vo- ized in each repetition. For each of these repetitions, all cabulary richness feature LIFE from [7], which counts the combinations of features and classifiers are tested, and the vocabulary frequency over differently sized windows and cal- mean value of each combination across all repetitions is used culates various aggregated measures. We refrain from using as a representative for that combination. This also functions other language-agnostic features presented in related cross- as a supplement for traditional cross-validation, which is language research [1], which depend on language-specific impossible for cross-domain classification as documents in resources like sentiment databases, which are difficult to the training set can’t be used interchangeably for testing, collect and even harder to compare. Additionally, in their which would break the cross-domain nature of the setup. We research, these approaches showed inferior performance com- are aware that this results in some datasets having a larger pared to character-based features from machine-translated overlap between the repetitions than others, which is a flaw text. We use the same linear SVM and extreme gradient that might be mitigated in the future if more comprehen- boosting classifiers as the tf/idf frequency feature category sive corpora of bilingual authors become available, or direct to classify the documents with LIFE features (see Figure 4). comparison between results originating from differently sized datasets is not important. 5. RESULTS AND DISCUSSION 4.3 Models and Baselines We run the classification experiment for each model, each language pair in both directions, and every parameter com- We test several different text classification models by fol- bination shown in Table 2, generating an exhaustive grid of lowing previous approaches in authorship attribution tasks. results. In this section, different aggregations and selections These are summarized in Figure 4. of this entire result set are used to extract the key findings Firstly, calculating tf/idf-normalized frequencies of differ- for this paper. ent types of n-grams has been used widely in the authorship analysis field, including character, word, or part-of-speech 5.1 Performance per Model tag n-grams. This approach can be used analogously by Table 3 shows that the linear support vector machine counting the frequencies of the parsed DT-grams and nor- with tf/idf frequency features outperforms all other models malizing them using tf/idf. We then test two commonly in every language combination and for most of the feature used classifiers: linear SVMs [16, 11, 6] and extreme gradient categories. In the case of the vocabulary richness feature boosting [7]. As comparison baselines of this category, we LIFE, we can confirm the results of the original work that include results from character, word, and universal POS tag the random forest-based approach outperforms the support 4 vector machine [8]. https://www.deepl.com/, translation performed in Novem- ber 2019 We suspect that the CNN model underperforms because we Model EN⁄DE EN⁄ES EN⁄FR EN ⁄NL EN⁄PT EN ⁄DeepL LIFE Word n-grams svm 0.375 0.291 0.310 0.277 0.246 0.479 Char. n-grams Uni. POS tag n-grams xgb 0.268 0.207 0.229 0.209 0.175 0.332 DT-grams cnn 0.112 0.108 0.104 0.102 0.119 0.133 d2v 0.261 0.180 0.179 0.193 0.213 0.344 0.5 (a) Max. F1macro score of the models across all datasets. 0.4 F1macro “DeepL” denotes the German documents machine-translated to English with DeepL. 0.3 0.2 Word Char. Uni. POS Model LIFE n-grams n-grams n-grams DT-grams 0.1 svm 0.110 0.396 0.479 0.385 0.453 0 xgb 0.157 0.189 0.332 0.282 0.328 E S R L T L cnn - 0.092 0.075 0.133 0.102 /D /E /F /N /P ep EN EN EN EN EN /De d2v - 0.143 0.341 0.344 0.336 EN (b) Max. F1macro score of the models across different features. Figure 5: Comparison of the highest F1macro scores for dif- ferent feature types. The different datasets are plotted on Table 3: F1macro of the models across different datasets (a) the x-axis, where “DeepL” stands for the documents that and features (b). have been machine-translated from German to English. For layout reasons, experiments that differ only in classification DTg EN ⁄DE EN⁄ES EN⁄FR EN⁄NL EN⁄PT EN⁄DeepL direction (e.g., en Ñ de and de Ñ en) are averaged, whereas the difference in F1macro between the directions was below DTanc 0.33 0.21 0.23 0.23 0.18 0.42 DTsib 0.29 0.24 0.24 0.25 0.25 0.42 0.02 for each pair. The DT-gram feature outperforms the DTpq 0.35 0.26 0.28 0.28 0.29 0.43 next best feature by 0.081 F1macro averaged over all untrans- DTinv 0.37 0.30 0.29 0.23 0.27 0.43 lated language pairs. Table 4: Max. F1macro score of each DT-gram type. experiments including datasets from less related language families such as Japanese or Arabic may provide further insights into this relationship. have significantly less training documents than in the original The proposed DT-gram feature is the most effective feature paper, in which case network models have been shown to for the untranslated scenarios, outperforming the next best have trouble capturing the style of authors [5]. feature across the language pairs by an average of 0.081 While the document embedding model (d2v in the table) F1macro . outperforms the frequency-based features with the extreme This suggests that the grammatical characteristics of mul- boosting trees in some cases, it does not reach the support tilingual authors are kept across languages. The perfor- vector machine’s F1 scores in any language or feature set. mance of these features consistently outperforms n-grams constructed from the universal POS tag-based on the original 5.2 Performance per Feature Category word order, we conclude that the dependency relationships Figure 5 displays the highest F1macro score for each fre- between the words and therefore, a grammatical style con- quency feature category and dataset. It becomes clear that tribute to an author’s stylometric fingerprint. the vocabulary richness feature LIFE is not able to model the When comparing the different languages, we can see a authors effectively. An explanation for this is found in the clear difference in classification performance. For the two basic principle behind the feature itself, which counts aggre- grammatical feature types, namely universal POS tag n- gated vocabulary richness measures across sliding windows grams and DT-grams, the results of the German dataset over the document. Being originally developed for classifying show better F1 scores compared to the other languages. One entire novels from professional authors allowed these window possible explanation for this result the overall higher grammar sizes to be large and carry more information than is the case complexity of German compared to the other languages [13], with shorter texts. Likewise and unsurprisingly, the word which would, in turn, suggest that either (1) classification n-grams are not able to model authorship except for the across languages with grammars of different complexity, or (2) machine-translated dataset, which is the only case where classification across languages with general high complexity a significant intersection between training and validation improve the usefulness of grammar features themselves. vocabulary can be expected. However, to answer these questions, additional language Confirming the results of [1], we observe that traditional combinations must be analyzed, which may prove difficult features are effective in classifying machine-translated text, for low-resource languages given the already small amount outperforming all other features. We can also confirm their of available data from bilingual authors for languages that finding that machine-translation increases the performance of are not considered low-resource. language-independent features. Interestingly, the character In summary, no approach is able to beat traditional meth- n-gram features perform well above the 10% random baseline ods performed on machine-translated texts, but our proposed also for the non-translated datasets. This suggests a measure DT-gram feature outperforms all other tested features on of similarity between these languages, but we leave the inter- untranslated cross-language scenarios, especially on German pretation of these results to the field of linguistics. Future documents. It represents a promising start for future de- En/German En/Spanish En/French En/Dutch En/Portuguese En/Translation 0.5 0.4 F1macro 0.3 DTsib 0.2 DTpq DTinv 0.1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 (a) Influence of the horizontal (red) parameter value (x-axis) on the F1macro score (y-axis). En/German En/Spanish En/French En/Dutch En/Portuguese En/Translation 0.5 0.4 F1macro 0.3 DTanc 0.2 DTpq DTinv 0.1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 (b) Influence of the vertical (blue) parameter value (x-axis) on the F1macro score (y-axis). Figure 6: Influence of the horizontal (a) and vertical (b) DT-gram parameter sizes. Note that DTsib is only included in (a) as it lacks a vertical parameter, and likewise, DTanc is only included in (b). Node ⁄DE EN EN ⁄ES EN ⁄FR EN ⁄NL ⁄PT EN ⁄DeepL EN Only DTanc benefits from a higher vertical parameter size, Dep. 0.366 0.239 0.274 0.257 0.218 0.445 especially in German documents, which may benefit from U.POS 0.375 0.291 0.310 0.277 0.246 0.450 even higher values of the respective parameter. While Span- both 0.368 0.232 0.294 0.262 0.235 0.453 ish shows the least difference in classification performance across the different parameter sizes, it is difficult to draw Table 5: Max. F1macro scores of different internal node conclusions from the other languages, indicating that more layouts for the dependency tree. data is required for further experiments. velopment and research of true cross-language authorship 6. CONCLUSION attribution. In this paper, we have presented a novel type of classifica- tion feature called DT-grams, based on dependency graphs 5.3 Performance by Tree Node Structure and universal POS tags. We have shown in experiments that As described in Section 3.1, we tried different representa- DT-grams able to efficiently model stylometric fingerprints tions of the internal nodes of the dependency tree structure. of bilingual authors across languages, premiering authorship In Table 5, the best results for each of these can be found. analysis even in cases where machine-translation is unavail- Interestingly, the type of dependency which is used in the able, with an average lead of 0.081 F1macro to the next best graph does not seem to have a large impact on the classifica- approach tested in our experiments. Additionally, we have ex- tion performance, but rather using only the structure of the panded the field of cross-language authorship attribution by graph along with the universal POS tag of each word shows providing baseline results for the previously undocumented the biggest advantage. problem of untranslated cross-language authorship attribu- tion of bilingual authors and analyzed results of 5 different 5.4 Tree Substructure Performance Analysis language pairs. Finally, we have collected findings including As we demonstrated the general efficiency of the depen- unexpectedly good performances of language-dependent fea- dency tree-based features, Table 4 shows how the different tures applied to cross-language settings as well as significant DT-grams perform on each language combination. In general, differences across language pairs. the substructures that combine ancestor and sibling nodes The most important limitations of our approach are the (DTpq and DTinv ) outperform the more simple patterns for dependency on the performance of the external parsing tools each language and suggest that complex structures in gram- used, which may differ in quality across languages, as well as matical style are a valuable stylometric feature for bilingual the superior performance of approaches based on machine- authors across languages. translation. Figure 6 shows a more detailed analysis of how the sizes of In future work, we want to investigate on using more the two parameters influence this result. For both the verti- specialized syntax classification models like tree-LSTMs [18] cal and horizontal parameters, the optimal value is between 2 or more complex syntactic networks [4], as well as combining and 3, depending on the language and substructure, which is multiple feature categories to further improve classification similar to reported optimal values for character n-grams [16]. results in both cross- and single-language experiment settings. 7. REFERENCES T. Honkela. Complexity of european union languages: [1] D. Bogdanova and A. Lazaridou. Cross-language A comparative approach. Journal of Quantitative authorship attribution. In Proceedings of the 9th Linguistics, 15(2):185–211, 2008. International Conference on Language Ressources and [14] P. Shrestha, S. Sierra, F. Gonzalez, M. Montes, Evaluation (LREC’2014), pages 2015–2020, 2014. P. Rosso, and T. Solorio. Convolutional neural [2] M. Eder. Style-markers in authorship attribution : a networks for authorship attribution of short texts. In cross-language study of the authorial fingerprint. Proceedings of the 15th Conference of the European Studies in Polish Linguistics, 6(1):99–114, 2011. Chapter of the Association for Computational [3] H. Gómez-Adorno, J.-P. Posadas-Durán, G. Sidorov, Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 2017. and D. Pinto. Document embeddings learned on various types of n-grams for cross-topic authorship [15] G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, attribution. Computing, 100(7):741–756, 2018. and L. Chanona-Hernández. Syntactic [4] F. Jafariakinabad and K. A. Hua. Style-aware neural Dependency-Based N-grams as Classification Features, model with application in authorship attribution. In volume 11 of Mexican International Conference on 2019 18th IEEE International Conference On Machine Artificial Intelligence (MICAI’2012), pages 1–11. Learning And Applications (ICMLA), pages 325–328. Springer Heidelberg Berlin, 2013. IEEE, 2019. [16] E. Stamatatos. On the Robustness of Authorship [5] M. Kestemont, M. Tschugnall, E. Stamatatos, Attribution Based on Character N-Gram Features. W. Daelemans, G. Specht, B. Stein, and M. Potthast. Journal of Law & Policy, pages 421–439, 2013. Overview of the Author Identification Task at [17] L. M. Stuart, S. Tazhibayeva, A. R. Wagoner, and J. M. PAN-2018: Cross-domain Authorship Attribution and Taylor. Style features for authors in two languages. In Style Change Detection. In L. Cappellato, N. Ferro, 2013 IEEE/WIC/ACM International Joint Conferences J.-Y. Nie, and L. Soulier, editors, Working Notes on Web Intelligence (WI) and Intelligent Agent Papers of the CLEF 2018 Evaluation Labs, CEUR Technologies (IAT), pages 459–464. IEEE, 2013. Workshop Proceedings. CLEF and CEUR-WS.org, [18] K. S. Tai, R. Socher, and C. D. Manning. Improved 2018. semantic representations from tree-structured long [6] M. Koppel, J. Schler, S. Argamon, and E. Messeri. short-term memory networks, 2015. Authorship attribution with thousands of candidate [19] M. Tschuggnall and G. Specht. Countering Plagiarism authors. In Proceedings of the 29th annual international by Exposing Irregularities in Authors’ Grammar. In ACM SIGIR conference on Research and development Proceedings of the European Intelligence and Security in information retrieval, pages 659–660. ACM, 2006. Informatics Conference, (EISIC’2013), pages 15–22. [7] M. Llorens and S. J. Delany. Deep level lexical features IEEE, 2013. for cross-lingual authorship attribution. In Proceedings [20] L. Venuti. The translator’s invisibility: A history of of the first Workshop on Modeling, Learning and translation. Routledge, 1995. Mining for Cross/Multilinguality, pages 16–25. Dublin [21] R. Zhang, Z. Hu, H. Guo, and Y. Mao. Syntax Institute of Technology, 2016. encoding with application in authorship attribution. In [8] M. Llorens-Salvador. Lexical rIchness Feature Proceedings of the 2018 Conference on Empirical Extraction method (LIFE) for Multilingual and Methods in Natural Language Processing. Association Cross-lingual Authorship Attribution. Dissertation, for Computational Linguistics, 2018. Dublin Institute of Technology, 2018. [9] K. Luyckx and W. Daelemans. Shallow Text Analysis and Machine Learning for Authorship Attribution. In Proceedings of the 15th meeting of Computational Linguistics in the Netherlands, pages 149–160. LOT, 2005. [10] B. Murauer and G. Specht. Generating cross-domain text classification corpora from social media comments. In Working Notes of the Conference and Labs of the Evaluation forum (CLEF’2019), pages 114–125. Springer, 2019. [11] A. Narayanan, H. Paskov, N. Z. Gong, J. Bethencourt, E. Stefanov, E. C. R. Shin, and D. Song. On the feasibility of internet-scale author identification. In 2012 IEEE Symposium on Security and Privacy, pages 300–314. IEEE, 2012. [12] J. Nivre, M.-C. De Marneffe, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, et al. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1659–1666, 2016. [13] M. Sadeniemi, K. Kettunen, T. Lindh-Knuutila, and