DT-grams: Structured Dependency Grammar Stylometry
           for Cross-Language Authorship Attribution

                          Benjamin Murauer                                                Günther Specht
                      Universität Innsbruck, Austria                                 Universität Innsbruck, Austria
                       b.murauer@posteo.de                                         guenther.specht@uibk.ac.at


ABSTRACT                                                                   over time. Therefore, language-independent alternatives to
Cross-language authorship attribution problems rely on ei-                 traditional attribution features are crucial for cross-language
ther translation to enable the use of single-language features,            attribution without translation.
or language-independent feature extraction methods. Until                     Candidates for such features include high-level measure-
recently, the lack of datasets for this problem hindered the               ments like vocabulary or punctuation statistics [11] or fea-
development of the latter, and single-language solutions were              tures that can be mapped to a general space like universal
performed on machine-translated corpora. In this paper,                    grammar representations [1]. In this paper, our first con-
we present a novel language-independent feature for author-                tribution is a novel type of classification feature, DT-grams
ship analysis based on dependency graphs and universal part                (dependency tree grams), that is based on dependency graphs
of speech tags, called DT-grams (dependency tree grams),                   and universal part-of-speech (POS) tags, making it language-
which are constructed by selecting specific sub-parts of the               independent. It calculates frequencies of substructures within
dependency graph of sentences. We evaluate DT-grams by                     a dependency graph similar to how in traditional n-grams,
performing cross-language authorship attribution on untrans-               frequencies of character or word combinations in the original
lated datasets of bilingual authors, showing that, on average,             text are counted. We show that this feature is efficient for
they achieve a macro-averaged F1 score of 0.081 higher than                cross-language authorship attribution, a problem in which
previous methods across five different language pairs. Ad-                 documents of bilingual authors are classified, but the lan-
ditionally, by providing results for a diverse set of features             guage differs between training and testing documents. In our
for comparison, we provide a baseline on the previously un-                experiments, DT-grams outperform other approaches in this
documented task of untranslated cross-language authorship                  field consistently by an average F1macro score of 0.081.
attribution.                                                                  For the authorship attribution experiment, we use a dataset
                                                                           consisting of social media comments of bilingual authors in
                                                                           multiple language pairs. This distinguishes this work from
1.    INTRODUCTION                                                         previous research, which used artificially constructed cor-
   In cross-language authorship attribution, the true author of            pora due to the lack of data from multilingual authors [1, 7].
a previously unseen document must be determined from a set                 Thereby, classic novels by professional authors were used as
of candidate authors after training a model with documents                 training data, and human-translated versions of other novels
from those candidates in a different language. Previous                    by the same author are used as evaluation data. Although
work in single-language attribution often relies on language-              research has shown that human translation does not elimi-
specific features. Here, popular and powerful features often               nate stylometric features [20], the original author still has
exploit character- and word-based measures [16, 2]. Using                  only written in one language. Therefore, we argue that the
translation enables easy re-use of these features, and has been            classification problem is, more strictly speaking, a translation
shown to be a useful tool for cross-language attribution [1].              obfuscation measurement rather than an authorship attribu-
However, setting up a custom machine translation system is                 tion problem. By performing our evaluation experiments on
an expensive operation in terms of time and resources. From                the untranslated data from bilingual authors, we add a sec-
a scientific perspective, translations from commercial and                 ond contribution to this paper by providing the first baseline
therefore, closed-source systems are difficult to explain and              for true, untranslated cross-language authorship attribution.
reproduce, as the details of the models are unknown to the                    Summarized, our contribution in this paper is twofold: (1)
customer and commercial providers will likely try to improve               we present a new feature type DT-grams for cross-language
their models, causing different translations of the same input             authorship analysis, and (2) our evaluations represent a base-
                                                                           line for the novel problem of true, untranslated cross-language
                                                                           authorship attribution. To ensure the reproducibility of our
                                                                           results, all of our data and code is published online1 .


                                                                           2.     RELATED WORK
                                                                             Cross-language authorship analysis is a significantly more
32nd GI-Workshop on Foundations of Databases (Grundlagen von Daten-
                                                                           difficult problem than its single-language version [16], and
banken), September 01-03, 2021, Munich, Germany.
Copyright © 2021 for this paper by its authors. Use permitted under Cre-   1
ative Commons License Attribution 4.0 International (CC BY 4.0).               https://git.uibk.ac.at/csak8736/gvdb2021-code
in many cases, know-how learned from single-language au-                                                          nmod
thorship analysis can’t be directly used. For example, simple                                  dobj                      case
syntactic features like word or character n-grams are an effec-                     nsubj
                                                                              det                     det                       det
tive feature for stylometry [5], but are not suitable when the
training and testing documents only share a few words, or
even characters when given a different alphabet. Generally,             the      cat     saw    a       mouse    in    the         field
using grammar features for authorship classification has been          DET      NOUN    VERB   DET      NOUN    ADP   DET         NOUN
proven effective in many tasks ranging from attribution [9,
21, 4] to plagiarism detection [19]. Although these examples        Figure 1: Dependency graph representation of the sentence
use language-specific grammar features in single-language          ‘the cat saw a mouse in the field’.
settings, they show the general ability of these features to
distinguish authorship, and language-independent grammar
features such as universal POS tags allow for cross-language       3.         DT-GRAMS CONSTRUCTION
classification [1].                                                   To construct the proposed DT-grams feature, we parse
   Using different combinations of words by leveraging the de-     textual data to obtain dependency relationships between the
pendency of sentences rather than the original word order has      words within sentences, which are then mapped to a tree
lead to increased classification performance [15]. However,        structure. Then, differently sized substructures are selected
this study does not make use of language-independent fea-          from those trees to produce sequences of DT-grams. Finally,
tures but rather changes how word n-grams are constructed          while some classification models used in our experiments
by providing an alternative measure of which words neighbor        use these sequences directly, we also reduce them to tf/idf-
each other. Nevertheless, their findings suggest that the          normalized frequencies to form a bag-of-DT-grams for other
dependency relationships between words within sentences            models used in the evaluation. In the following section, these
hold valuable information for authorship analysis.                 steps are explained in detail.
   Our proposed feature, DT-grams, leverages key findings
of previous observations by combining language-independent         3.1         Grammar Representations
universal POS tags in combination with dependency graphs.             In the first step, the raw text is parsed by a dependency
   Previous attempts at cross-language attribution define the      parser. For this, we use the stanza 2 python library. This pro-
task itself inconsistently and different approaches to this        duces graphs as depicted in Figure 1. Along with the depen-
term are taken, including datasets of monolingual authors          dency graph, the parser also provides additional information
of different languages [17] or comparing the performance           for each word, including its lemma and universal POS tag.
of feature families in mono-lingual attribution problems for       The latter is a mapping from the more fine-grained language-
different languages [2]. When refining the definition of cross-    dependent POS tag to a coarse, but language-independent
language attribution as the task of attributing authors that       universal tag [12], and we use it as a supplemental represen-
have written documents in multiple languages, and training         tation of the word itself and by discarding the original word.
and testing documents must be written in different languages,      This way, we construct a language-independent tree from
few existing studies remain: [1] use a variety of different        the graph of each sentence, and encode both the relationship
features including the frequency of universal POS tags on          between the words as well as their grammatical role.
attribution, but conclude that machine-translation followed           We test three different representations of the nodes within
by traditional attribution techniques provides the best results.   the tree which are depicted in Figure 2: (1) the name of the
[7] use differently sized windows in which vocabulary richness     incoming dependency (Figure 2a), (2) the universal POS tag
measurements are aggregated. However, in both works, the           of the word (Figure 2b), and (3) both (Figure 2c). This way,
datasets that were used contain human-translated novels,           we hope to gain insight into which parts of the dependency
where the original author only wrote in one language and the       graph are more important for authorship stylometry. The
source of the other languages was added by using translations      resulting influence of these choices is discussed in Section 5.
of these works. Although it has been shown that translation           A similar representation of sentences can be achieved by
keeps stylistic features mostly intact [20], we claim that the     using constituency parsers, which we refrained from using
setup by these studies more likely measures the extent to          for two reasons: firstly, the availability of parser models for
which the authorship was obfuscated by the translator rather       non-English languages is limited, and secondly, the result-
than the authorship itself. We state that authors writing in       ing constituents are not language-independent and a global
multiple languages are likely to do so in different styles, and    mapping must be used in order to perform cross-language
we distinguish this problem as a different type of task.           classification. While such mappings exist for POS tags [12],
   Therefore, in this paper, we use social media texts that        no similar resources for constituents are available to our
have been written by bilingual authors [10]. While this            knowledge.
change in text type makes it more difficult to compare the
results directly to previous work, it also allows us to analyze    3.2         Tree Substructure Representations
a more comprehensive set of language pairs that are available
within this resource, and have not been included in previous         Along the lines of [19], we use patterns of tree structures
studies due to the lack of data. More importantly though, by       representing parts of the dependency tree. We propose sev-
using this resource, our evaluations of the DT-grams feature       eral patterns, which we collectively call DT-grams and which
along with several previously established baseline features        are displayed in Figure 3. The intention behind choosing
provide first reference results for untranslated authorship        these specific structures is as follows: We first extract node
attribution in five different language pairs.                      combinations from direct ancestors (DTanc , Figure 3a) and
                                                                   2
                                                                       https://github.com/stanfordnlp/stanza
          dobj                                                            blue=3                       red=2
                                           NOUN
 det              nmod             DET            NOUN
           case          det               ADP           DET
     (a) Dependency name            (b) Universal POS tag
                                                                           (a) DTanc (ancestors)           (b) DTsib (siblings)
                    NOUN#dobj                                             blue=2                      blue=2
                                                                          red=3                       red=3
             DET#det           NOUN#nmod
                       ADP#case            DET#det
                       (c) Concatenation
                                                                          (c) DTpq (PQ-grams)        (d) DTinv (inverted PQ)
Figure 2: Three node representations of the dependency
graph of the subphrase ”mouse in the field” from Figure 1          Figure 3: DT-grams. Substructures are based on simple tree
containing the name of the dependency (a), the universal           building blocks (a, b), PQ-grams by [19] (c) and an inverted
POS tag (b), and both (c).                                         form thereof (d).


siblings (DTsib , Figure 3b), representing the most basic build-           Languages       A       Docs       Ldoc    D/Amin
ing blocks of a tree. In Figure 3c, DTpq is displayed, based on            EN + DE         10      2,790     3,055    22 + 20
the PQ-grams used by [19]. Finally, we add DTinv that use                  EN + DeepL      10      2,790     3,055    22 + 20
a different order of sibling/ancestor relationship (Figure 3d)             EN + ES         20      3,402     3,148    20 + 21
compared to PQ-grams.                                                      EN + PT         37      4,481     2,996    20 + 20
   While character and word-based n-grams only have one                    EN + NL         11      2,056     3,225    20 + 20
dimension to scale (namely, n), these tree substructures can               EN + FR         45      7,374     3,142    21 + 20
have more. In general, two parameters control the number
of siblings (red) and ancestors (blue) taken into account for
each pattern, whereas DTanc and DTsib both only have one           Table 1: Datasets used for evaluation. A denotes the number
of those parameters each. For DTanc and DTsib , setting the        of authors. Ldoc denotes the average document length in
parameter to 1 results in calculating POS tag unigramsd.           characters. D/Amin denotes the minimum number of doc-
   To get instances of the DT-gram patterns from a tree, the       uments written by each author in the respective languages
substructure patterns are moved across the tree similar to a       in the first column. “DeepL” corresponds to the German
sliding-window, generating an instance of the substructure         documents machine-translated to English with DeepL.
at every step. Thereby, one has to define an order in which
the DT-grams are parsed from the trees (i.e., depth-first or
breadth-first). If a substructure does not fit onto a certain      available to our knowledge, we use the framework by [10] to
position of a tree, the empty spots in the pattern are filled      generate several datasets by bilingual authors in different lan-
with a wildcard element X. Thereby, an instance is generated       guages. It collects user comments from the social media site
for every step as long as at least one of the substructure’s       Reddit and allows us to set minimum requirements for docu-
positions is filled with a non-wildcard node.                      ment count, length, and language. We use this resource to
   This way, the sequence of DT-grams can either be used           evaluate the performance of DT-grams for different language
directly as input for a sequence-based model (e.g., a recurrent    pairs and generate bilingual datasets for the combinations
network), or the frequencies of the parsed instances can be        presented in Table 1. We choose five different language pairs
used analogously to those of character or word n-grams.            which all contain English, which represents the largest por-
   For example, applying DTanc shown in Figure 3a with             tion of text in Reddit comments. The other languages were
its parameter set to 3 to the tree in Figure 2a results in         chosen as they represent the largest non-English text sources
11 substructures: X-X-dobj, X-dobj-det, dobj-det-X, det-X-         for this corpus. We set the parameters of the generation
X, X-dobj-nmod, dobj-nmod-case, nmod-case-X, case-X-X,             framework to produce corpora with at least 10 authors for
dobj-nmod-det, nmod-det-X, det-X-X,                                each pair, where each author has at least 20 documents for
   Finally, the frequency of each produced instance is counted     both languages. To increase the quality of the text docu-
over the entire document, and these frequencies are then           ments, we also required a minimum document length of 3,000
tf/idf-normalized over the entire dataset.                         characters. The tools that generate these corpora perform
                                                                   preprocessing including replacing URLs with a tag <URL> or
                                                                   filtering messages that mainly consist of punctuation. For a
4.     EVALUATION                                                  full list of preprocessing steps, we refer to the original pub-
  To evaluate the DT-grams feature, we perform cross-              lication by [10]. We performed no additional preprocessing.
language authorship attribution using data from multiple           The resulting corpora are shown in Table 1 and we provide
language pairs and different classifiers, and we compare the       them publicly for download3 .
results to different baseline features.                               In previous work, mono-lingual attribution techniques on
                                                                   machine-translated documents outperform cross-language
4.1      Datasets
                                                                   3
     Since there are no untranslated cross-language corpora            https://git.uibk.ac.at/csak8736/gvdb2021-code
                                                                        Parameter             Values
             LIFE                              linear SVM               n-gram size           1-3
Documents

            DT-grams                                                    DT-gram structure     DTanc , DTsib , DTpq , DTinv
                                frequencies    XGBoost                  DT-gram dim. sizes    1 – 4, 1 – 4
            word n-grams                                                C-value of SVM        0.1, 1, 10
                                               Doc2Vec + LR             Doc2Vec emb. size     50, 100, ..., 250
            univ. POS n-grams   sequences                               CNN batch size        5, 10, 20
            character n-grams                  CNN
                                                                  Table 2: Hyperparameters optimized by grid search. All
             Figure 4: Models used in the experiments.            n-gram sizes were tested individually for word, character and
                                                                  universal POS-tag n-grams.

techniques [1]. We therefore provide data to calculate such a
baseline by using the commercial translation service DeepL4       n-grams, whereby n ranges from 1 to 5.
to translate the German documents to English, creating a            Secondly, we utilize the Doc2Vec document embedding
mono-lingual version of the German documents for compari-         technique in combination with a logistic regression classifier,
son. However, due to budgetary reasons, we only perform           as proposed by [3]. For this solution, we have to define what a
this step for one randomly picked language (German).              document is in terms of DT-grams, as their order is no longer
  For each language pair pA, Bq, we conduct all experiments       well-defined. We interpret each document as the sequence
both with training on A and testing on B, as well as the          of DT-grams that is returned by the parser, which in our
other way around.                                                 case uses a depth-first approach. We include baselines for
                                                                  comparison along the lines of [3], which consist of character,
4.2         Evaluation Strategy                                   word, and universal POS n-grams ranging from n=1 to 5.
   Since the parameterized datasets only define lower limits        Thirdly, we use a convolutional neural network proposed
for the number of documents per author and the size of these      in [14] by interpreting each DT-gram as a unique token used
documents, the resulting datasets have varying amounts            in the embedding layer of the network. Thereby, we use the
of documents and authors. We ensure that results from             same parameters and network layout as in [14], except for an
experiments using these datasets can be easily compared           increased embedding layer size to fit the larger documents.
to each other by only selecting 10 random authors of each         We utilize the same depth-first order as in the second ap-
dataset, and selecting 10 random documents of each language       proach to define a sequence of tokens. The baseline for this
from those authors.                                               model uses character, word, and universal POS tag unigram
   To reduce bias, each of these evaluations is repeated 10       representations of the documents.
times, and the selected authors and documents are random-           As a further comparison baseline, we compute the vo-
ized in each repetition. For each of these repetitions, all       cabulary richness feature LIFE from [7], which counts the
combinations of features and classifiers are tested, and the      vocabulary frequency over differently sized windows and cal-
mean value of each combination across all repetitions is used     culates various aggregated measures. We refrain from using
as a representative for that combination. This also functions     other language-agnostic features presented in related cross-
as a supplement for traditional cross-validation, which is        language research [1], which depend on language-specific
impossible for cross-domain classification as documents in        resources like sentiment databases, which are difficult to
the training set can’t be used interchangeably for testing,       collect and even harder to compare. Additionally, in their
which would break the cross-domain nature of the setup. We        research, these approaches showed inferior performance com-
are aware that this results in some datasets having a larger      pared to character-based features from machine-translated
overlap between the repetitions than others, which is a flaw      text. We use the same linear SVM and extreme gradient
that might be mitigated in the future if more comprehen-          boosting classifiers as the tf/idf frequency feature category
sive corpora of bilingual authors become available, or direct     to classify the documents with LIFE features (see Figure 4).
comparison between results originating from differently sized
datasets is not important.                                        5.    RESULTS AND DISCUSSION
4.3         Models and Baselines                                    We run the classification experiment for each model, each
                                                                  language pair in both directions, and every parameter com-
  We test several different text classification models by fol-    bination shown in Table 2, generating an exhaustive grid of
lowing previous approaches in authorship attribution tasks.       results. In this section, different aggregations and selections
These are summarized in Figure 4.                                 of this entire result set are used to extract the key findings
  Firstly, calculating tf/idf-normalized frequencies of differ-   for this paper.
ent types of n-grams has been used widely in the authorship
analysis field, including character, word, or part-of-speech      5.1    Performance per Model
tag n-grams. This approach can be used analogously by
                                                                    Table 3 shows that the linear support vector machine
counting the frequencies of the parsed DT-grams and nor-
                                                                  with tf/idf frequency features outperforms all other models
malizing them using tf/idf. We then test two commonly
                                                                  in every language combination and for most of the feature
used classifiers: linear SVMs [16, 11, 6] and extreme gradient
                                                                  categories. In the case of the vocabulary richness feature
boosting [7]. As comparison baselines of this category, we
                                                                  LIFE, we can confirm the results of the original work that
include results from character, word, and universal POS tag
                                                                  the random forest-based approach outperforms the support
4                                                                 vector machine [8].
  https://www.deepl.com/, translation performed in Novem-
ber 2019                                                            We suspect that the CNN model underperforms because we
 Model EN⁄DE      EN⁄ES   EN⁄FR   EN ⁄NL   EN⁄PT   EN ⁄DeepL                             LIFE              Word n-grams
 svm      0.375   0.291   0.310    0.277   0.246     0.479                               Char. n-grams     Uni. POS tag n-grams
 xgb      0.268   0.207   0.229    0.209   0.175     0.332                               DT-grams
 cnn      0.112   0.108   0.104    0.102   0.119     0.133
 d2v      0.261   0.180   0.179    0.193   0.213     0.344                      0.5

(a) Max. F1macro score of the models across all datasets.                       0.4


                                                                      F1macro
“DeepL” denotes the German documents machine-translated
to English with DeepL.                                                          0.3
                                                                                0.2
             Word    Char. Uni. POS
 Model LIFE n-grams n-grams n-grams DT-grams                                    0.1
 svm   0.110 0.396   0.479   0.385   0.453                                       0
 xgb   0.157 0.189   0.332   0.282   0.328                                                  E       S      R      L       T         L
 cnn     -   0.092   0.075   0.133   0.102                                               /D      /E      /F     /N     /P         ep
                                                                                      EN      EN      EN     EN     EN        /De
 d2v     -   0.143   0.341   0.344   0.336                                                                                 EN
(b) Max. F1macro score of the models across different features.
                                                                  Figure 5: Comparison of the highest F1macro scores for dif-
                                                                  ferent feature types. The different datasets are plotted on
Table 3: F1macro of the models across different datasets (a)      the x-axis, where “DeepL” stands for the documents that
and features (b).                                                 have been machine-translated from German to English. For
                                                                  layout reasons, experiments that differ only in classification
  DTg     EN ⁄DE EN⁄ES EN⁄FR EN⁄NL EN⁄PT EN⁄DeepL                 direction (e.g., en Ñ de and de Ñ en) are averaged, whereas
                                                                  the difference in F1macro between the directions was below
  DTanc    0.33    0.21    0.23    0.23     0.18     0.42
  DTsib    0.29    0.24    0.24    0.25     0.25     0.42         0.02 for each pair. The DT-gram feature outperforms the
  DTpq     0.35    0.26    0.28    0.28     0.29     0.43         next best feature by 0.081 F1macro averaged over all untrans-
  DTinv    0.37    0.30    0.29    0.23     0.27     0.43         lated language pairs.


    Table 4: Max. F1macro score of each DT-gram type.             experiments including datasets from less related language
                                                                  families such as Japanese or Arabic may provide further
                                                                  insights into this relationship.
have significantly less training documents than in the original      The proposed DT-gram feature is the most effective feature
paper, in which case network models have been shown to            for the untranslated scenarios, outperforming the next best
have trouble capturing the style of authors [5].                  feature across the language pairs by an average of 0.081
  While the document embedding model (d2v in the table)           F1macro .
outperforms the frequency-based features with the extreme            This suggests that the grammatical characteristics of mul-
boosting trees in some cases, it does not reach the support       tilingual authors are kept across languages. The perfor-
vector machine’s F1 scores in any language or feature set.        mance of these features consistently outperforms n-grams
                                                                  constructed from the universal POS tag-based on the original
5.2     Performance per Feature Category                          word order, we conclude that the dependency relationships
   Figure 5 displays the highest F1macro score for each fre-      between the words and therefore, a grammatical style con-
quency feature category and dataset. It becomes clear that        tribute to an author’s stylometric fingerprint.
the vocabulary richness feature LIFE is not able to model the        When comparing the different languages, we can see a
authors effectively. An explanation for this is found in the      clear difference in classification performance. For the two
basic principle behind the feature itself, which counts aggre-    grammatical feature types, namely universal POS tag n-
gated vocabulary richness measures across sliding windows         grams and DT-grams, the results of the German dataset
over the document. Being originally developed for classifying     show better F1 scores compared to the other languages. One
entire novels from professional authors allowed these window      possible explanation for this result the overall higher grammar
sizes to be large and carry more information than is the case     complexity of German compared to the other languages [13],
with shorter texts. Likewise and unsurprisingly, the word         which would, in turn, suggest that either (1) classification
n-grams are not able to model authorship except for the           across languages with grammars of different complexity, or (2)
machine-translated dataset, which is the only case where          classification across languages with general high complexity
a significant intersection between training and validation        improve the usefulness of grammar features themselves.
vocabulary can be expected.                                          However, to answer these questions, additional language
   Confirming the results of [1], we observe that traditional     combinations must be analyzed, which may prove difficult
features are effective in classifying machine-translated text,    for low-resource languages given the already small amount
outperforming all other features. We can also confirm their       of available data from bilingual authors for languages that
finding that machine-translation increases the performance of     are not considered low-resource.
language-independent features. Interestingly, the character          In summary, no approach is able to beat traditional meth-
n-gram features perform well above the 10% random baseline        ods performed on machine-translated texts, but our proposed
also for the non-translated datasets. This suggests a measure     DT-gram feature outperforms all other tested features on
of similarity between these languages, but we leave the inter-    untranslated cross-language scenarios, especially on German
pretation of these results to the field of linguistics. Future    documents. It represents a promising start for future de-
                                   En/German           En/Spanish             En/French            En/Dutch       En/Portuguese En/Translation
                        0.5
                        0.4
          F1macro

                        0.3                                                                                                               DTsib
                        0.2                                                                                                               DTpq
                                                                                                                                          DTinv
                        0.1
                               1     2   3     4   1     2    3      4    1     2   3     4    1    2   3     4   1   2   3   4   1   2   3   4
                              (a) Influence of the horizontal (red) parameter value (x-axis) on the F1macro score (y-axis).

                                  En/German            En/Spanish             En/French            En/Dutch       En/Portuguese En/Translation
                        0.5
                        0.4
         F1macro


                        0.3                                                                                                               DTanc
                        0.2                                                                                                               DTpq
                                                                                                                                          DTinv
                        0.1
                              1      2   3     4   1     2    3      4    1     2   3     4    1    2   3     4   1   2   3   4   1   2   3   4
                              (b) Influence of the vertical (blue) parameter value (x-axis) on the F1macro score (y-axis).

Figure 6: Influence of the horizontal (a) and vertical (b) DT-gram parameter sizes. Note that DTsib is only included in (a) as
it lacks a vertical parameter, and likewise, DTanc is only included in (b).


  Node              ⁄DE
                   EN         EN   ⁄ES   EN  ⁄FR   EN  ⁄NL    ⁄PT
                                                             EN           ⁄DeepL
                                                                         EN               Only DTanc benefits from a higher vertical parameter size,
  Dep. 0.366                   0.239     0.274     0.257     0.218        0.445           especially in German documents, which may benefit from
 U.POS 0.375                   0.291     0.310     0.277     0.246        0.450           even higher values of the respective parameter. While Span-
  both 0.368                   0.232     0.294     0.262     0.235        0.453           ish shows the least difference in classification performance
                                                                                          across the different parameter sizes, it is difficult to draw
Table 5: Max. F1macro scores of different internal node                                   conclusions from the other languages, indicating that more
layouts for the dependency tree.                                                          data is required for further experiments.


velopment and research of true cross-language authorship                                  6.       CONCLUSION
attribution.                                                                                 In this paper, we have presented a novel type of classifica-
                                                                                          tion feature called DT-grams, based on dependency graphs
5.3    Performance by Tree Node Structure                                                 and universal POS tags. We have shown in experiments that
   As described in Section 3.1, we tried different representa-                            DT-grams able to efficiently model stylometric fingerprints
tions of the internal nodes of the dependency tree structure.                             of bilingual authors across languages, premiering authorship
In Table 5, the best results for each of these can be found.                              analysis even in cases where machine-translation is unavail-
Interestingly, the type of dependency which is used in the                                able, with an average lead of 0.081 F1macro to the next best
graph does not seem to have a large impact on the classifica-                             approach tested in our experiments. Additionally, we have ex-
tion performance, but rather using only the structure of the                              panded the field of cross-language authorship attribution by
graph along with the universal POS tag of each word shows                                 providing baseline results for the previously undocumented
the biggest advantage.                                                                    problem of untranslated cross-language authorship attribu-
                                                                                          tion of bilingual authors and analyzed results of 5 different
5.4    Tree Substructure Performance Analysis                                             language pairs. Finally, we have collected findings including
  As we demonstrated the general efficiency of the depen-                                 unexpectedly good performances of language-dependent fea-
dency tree-based features, Table 4 shows how the different                                tures applied to cross-language settings as well as significant
DT-grams perform on each language combination. In general,                                differences across language pairs.
the substructures that combine ancestor and sibling nodes                                    The most important limitations of our approach are the
(DTpq and DTinv ) outperform the more simple patterns for                                 dependency on the performance of the external parsing tools
each language and suggest that complex structures in gram-                                used, which may differ in quality across languages, as well as
matical style are a valuable stylometric feature for bilingual                            the superior performance of approaches based on machine-
authors across languages.                                                                 translation.
  Figure 6 shows a more detailed analysis of how the sizes of                                In future work, we want to investigate on using more
the two parameters influence this result. For both the verti-                             specialized syntax classification models like tree-LSTMs [18]
cal and horizontal parameters, the optimal value is between 2                             or more complex syntactic networks [4], as well as combining
and 3, depending on the language and substructure, which is                               multiple feature categories to further improve classification
similar to reported optimal values for character n-grams [16].                            results in both cross- and single-language experiment settings.
7.   REFERENCES                                                        T. Honkela. Complexity of european union languages:
 [1] D. Bogdanova and A. Lazaridou. Cross-language                     A comparative approach. Journal of Quantitative
     authorship attribution. In Proceedings of the 9th                 Linguistics, 15(2):185–211, 2008.
     International Conference on Language Ressources and          [14] P. Shrestha, S. Sierra, F. Gonzalez, M. Montes,
     Evaluation (LREC’2014), pages 2015–2020, 2014.                    P. Rosso, and T. Solorio. Convolutional neural
 [2] M. Eder. Style-markers in authorship attribution : a              networks for authorship attribution of short texts. In
     cross-language study of the authorial fingerprint.                Proceedings of the 15th Conference of the European
     Studies in Polish Linguistics, 6(1):99–114, 2011.                 Chapter of the Association for Computational
 [3] H. Gómez-Adorno, J.-P. Posadas-Durán, G. Sidorov,               Linguistics: Volume 2, Short Papers. Association for
                                                                       Computational Linguistics, 2017.
     and D. Pinto. Document embeddings learned on
     various types of n-grams for cross-topic authorship          [15] G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh,
     attribution. Computing, 100(7):741–756, 2018.                     and L. Chanona-Hernández. Syntactic
 [4] F. Jafariakinabad and K. A. Hua. Style-aware neural               Dependency-Based N-grams as Classification Features,
     model with application in authorship attribution. In              volume 11 of Mexican International Conference on
     2019 18th IEEE International Conference On Machine                Artificial Intelligence (MICAI’2012), pages 1–11.
     Learning And Applications (ICMLA), pages 325–328.                 Springer Heidelberg Berlin, 2013.
     IEEE, 2019.                                                  [16] E. Stamatatos. On the Robustness of Authorship
 [5] M. Kestemont, M. Tschugnall, E. Stamatatos,                       Attribution Based on Character N-Gram Features.
     W. Daelemans, G. Specht, B. Stein, and M. Potthast.               Journal of Law & Policy, pages 421–439, 2013.
     Overview of the Author Identification Task at                [17] L. M. Stuart, S. Tazhibayeva, A. R. Wagoner, and J. M.
     PAN-2018: Cross-domain Authorship Attribution and                 Taylor. Style features for authors in two languages. In
     Style Change Detection. In L. Cappellato, N. Ferro,               2013 IEEE/WIC/ACM International Joint Conferences
     J.-Y. Nie, and L. Soulier, editors, Working Notes                 on Web Intelligence (WI) and Intelligent Agent
     Papers of the CLEF 2018 Evaluation Labs, CEUR                     Technologies (IAT), pages 459–464. IEEE, 2013.
     Workshop Proceedings. CLEF and CEUR-WS.org,                  [18] K. S. Tai, R. Socher, and C. D. Manning. Improved
     2018.                                                             semantic representations from tree-structured long
 [6] M. Koppel, J. Schler, S. Argamon, and E. Messeri.                 short-term memory networks, 2015.
     Authorship attribution with thousands of candidate           [19] M. Tschuggnall and G. Specht. Countering Plagiarism
     authors. In Proceedings of the 29th annual international          by Exposing Irregularities in Authors’ Grammar. In
     ACM SIGIR conference on Research and development                  Proceedings of the European Intelligence and Security
     in information retrieval, pages 659–660. ACM, 2006.               Informatics Conference, (EISIC’2013), pages 15–22.
 [7] M. Llorens and S. J. Delany. Deep level lexical features          IEEE, 2013.
     for cross-lingual authorship attribution. In Proceedings     [20] L. Venuti. The translator’s invisibility: A history of
     of the first Workshop on Modeling, Learning and                   translation. Routledge, 1995.
     Mining for Cross/Multilinguality, pages 16–25. Dublin        [21] R. Zhang, Z. Hu, H. Guo, and Y. Mao. Syntax
     Institute of Technology, 2016.                                    encoding with application in authorship attribution. In
 [8] M. Llorens-Salvador. Lexical rIchness Feature                     Proceedings of the 2018 Conference on Empirical
     Extraction method (LIFE) for Multilingual and                     Methods in Natural Language Processing. Association
     Cross-lingual Authorship Attribution. Dissertation,               for Computational Linguistics, 2018.
     Dublin Institute of Technology, 2018.
 [9] K. Luyckx and W. Daelemans. Shallow Text Analysis
     and Machine Learning for Authorship Attribution. In
     Proceedings of the 15th meeting of Computational
     Linguistics in the Netherlands, pages 149–160. LOT,
     2005.
[10] B. Murauer and G. Specht. Generating cross-domain
     text classification corpora from social media comments.
     In Working Notes of the Conference and Labs of the
     Evaluation forum (CLEF’2019), pages 114–125.
     Springer, 2019.
[11] A. Narayanan, H. Paskov, N. Z. Gong, J. Bethencourt,
     E. Stefanov, E. C. R. Shin, and D. Song. On the
     feasibility of internet-scale author identification. In
     2012 IEEE Symposium on Security and Privacy, pages
     300–314. IEEE, 2012.
[12] J. Nivre, M.-C. De Marneffe, F. Ginter, Y. Goldberg,
     J. Hajic, C. D. Manning, R. McDonald, S. Petrov,
     S. Pyysalo, N. Silveira, et al. Universal dependencies v1:
     A multilingual treebank collection. In Proceedings of the
     Tenth International Conference on Language Resources
     and Evaluation (LREC’16), pages 1659–1666, 2016.
[13] M. Sadeniemi, K. Kettunen, T. Lindh-Knuutila, and