<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deep Level Lexical Features for Cross-lingual Authorship Attribution.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marisa Llorens-Salvador. Sarah Jane Delany.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dublin Institute of Technology</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <fpage>16</fpage>
      <lpage>25</lpage>
      <abstract>
        <p>Crosslingual document classi cation aims to classify documents written in di erent languages that share a common genre, topic or author. Knowledge-based methods and others based on machine translation deliver state-of-the-art classi cation accuracy, however because of their reliance on external resources, poorly resourced languages present a challenge for these type of methods. In this paper, we propose a novel set of language independent features that capture language use from a document at a deep level, using features that are intrinsic to the document. These features are based on vocabulary richness measurements and are text length independent and self-contained, meaning that no external resources such as lexicons or machine translation software are needed. Preliminary evaluation results show promising results for the task of crosslingual authorship attribution, outperforming similar methods.</p>
      </abstract>
      <kwd-group>
        <kwd>Crosslingual document classi cation</kwd>
        <kwd>crosslingual authorship attribution</kwd>
        <kwd>deep level lexical features</kwd>
        <kwd>vocabulary richness features</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Despite the prevalence of the English language in many elds, international
organizations manage large numbers of documents in di erent languages, from
local legislation to internal documents produced in di erent company locations.
At the same time, workers' mobility has created a multilingual work force that
create and store documents in di erent languages depending on the context. For
example, the same author can write academic papers in English, write a technical
book in French and a novel in Catalan. The classi cation of these multilingual
documents has applications in the areas of information retrieval, forensic
linguistics and humanities scholarship.</p>
      <p>
        The analysis of document style and language use has long been used as a
tool for author attribution. Traditionally, research in the area focused on
monolingual corpora [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or employed external resources such as machine translation,
multilingual lexicons or parallel corpora [
        <xref ref-type="bibr" rid="ref14 ref15 ref3">3, 14, 15</xref>
        ].
      </p>
      <p>In this paper, we present a set of language independent lexical features and
study their performance when used to solve the problem of crosslingual author
attribution. The task of crosslingual author attribution (CLAA) refers to the
identi cation of the author of a document written in language xi from a pool of
known authors whose known documents are written in languages x1; x2; ::; xn.
The aim of the method is to identify the author of an unseen document without
prior knowledge about its language, i.e. without using any language speci c
features, tuning for a particular language or the use of machine translation/lexicon
aid in a completely language independent implementation.</p>
      <p>The proposed method builds on traditional vocabulary richness measures
(VR), such as type-token ratio or hapaxes frequency. Traditional vocabulary
richness features are text-length dependent and provide a small number of
features (type-token ratio being the best example with only one value representing
each text). In order to overcome these limitations, our proposed method for
feature extraction calculates features on xed length samples of text extracted from
the document. Mean and dispersion values for vocabulary richness are calculated
obtaining 8 deep level lexical features. The performance of di erent sample sizes i
is studied individually and as combinations of sizes, providing information about
text consistency through the document and characteristic vocabulary use.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Monolingual author attribution has in the last few years achieved a high level
of accuracy using lexical features such as frequencies of the most common words
and Burrow's Delta to calculate distances between documents [
        <xref ref-type="bibr" rid="ref1 ref11 ref13 ref4 ref7">1, 4, 7, 11, 13</xref>
        ].
Other lexical features used in monolingual author attribution include
frequencies of stop words [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and word n-grams. In these models, a feature vector with all
features (n-grams or stop words) contained in the document and their
frequencies characterizes each document. The problem when extending these methods
to multilingual corpora is that the dimensions of the feature vectors in di erent
languages are in general orthogonal, giving zero as the similarity measure
between documents. Character n-grams have been applied to di erent languages
and have obtained high levels of accuracy at the expense of high dimensionality
with feature set sizes in the thousands [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. At a syntactic level, features such as
part-of-speech and frequency of verbs and pronouns have achieved high level of
accuracy as well [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However, all the above features are either language
dependent or involve high dimensional feature sets.
      </p>
      <p>
        Traditional vocabulary richness like the type-token ratio are language
independent, however, they depend on text length and for this reason have been
replaced in recent times by more complex features. These features include the
Moving Window Type-Token Ratio and the Moving Window Type-Token
Ratio Distribution [
        <xref ref-type="bibr" rid="ref5 ref8">5, 8</xref>
        ]. Despite their language independence nature, traditional
measurements of vocabulary richness have not delivered highly accurate results
in the past [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Consequently, they have been replaced by the use of lexical
features in combination with machine translation software or lexicons/dictionaries
to bring all documents into the same language space with wikipedia and the
eurovoc corpus the most commonly used resources [
        <xref ref-type="bibr" rid="ref10 ref14 ref9">9, 10, 14</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>Based on vocabulary richness and frequency spectrum values, the proposed
features and method for feature extraction de ne a way of quantifying the style of
a text by analysing the use of vocabulary in samples of di erent sizes taken from
the text. These samples are based on the idea of a moving window type-token
ratio using xed size samples and hence avoiding the shortcomings of the
typetoken ratio. These features extend the moving window type-token ratio as more
granular measurements of word frequencies are extracted.</p>
      <p>Three sampling methods are included in the framework: (i) Fragment
sampling (FS), (ii) Bags-of-words sampling (BS) and (iii) the combination of both
Bag-Fragment sampling (BFS).</p>
      <p>Fragment sampling (FS) is de ned as the process of randomly obtaining n
samples of i consecutive words, starting from a word chosen at random and each
sample is referred to as a fragment. Given the random nature of the sampling
process these fragments can overlap and are not following any sequence in terms
of overall location in the text. Bags-of-words sampling (BS) involves the use of
i words sampled randomly from any part of the document and follows the well
known concept of treating a text as a bag-of-words where the location of words
is ignored .</p>
      <p>The proposed set of language independent lexical features is extracted
following a 4 step process:
STEP 1: A number n of document samples of size i is extracted.
STEP 2: Values for frequency features are calculated per sample.
STEP 3: Values for mean and dispersion features calculated across the n
samples.</p>
      <p>STEP 4: Back to step 1 for a new sample size i.</p>
      <p>The general parameters of the method are: type of sample (Fragment,
Bags-ofwords or both), sample sizes i1,i2,...,iM and number of samples n per sample
size. Figure 1 depicts a diagram for the extraction process for BFS. FS and BS
are represented by the left and right hand-side of the diagram respectively.</p>
      <p>The proposed set of frequency features are based on the analysis of the
frequency spectrum, i.e. how many times each feature appears. A typical example
of this type of features is the number of hapaxes or words that appear only once
in the text. Instead of using the entire frequency spectrum and in order to
reduce the number of features and capture information in a compact way, a novel
Fragments
Size i1 =100
…
…n
Fragments
Size iM =2000
……
n
……
……
……
….</p>
      <p>4 mean
features
4 dispersion
features
4 mean
features
4 dispersion
features</p>
      <p>BFS</p>
      <p>M sizes
N=M x 16 features</p>
      <p>Features
4 mean
4 dispersion
Features
4 mean
4 dispersion
……
……
……
….</p>
      <p>BoW
Size i1 =100</p>
      <p>BoW
Size iM =2000
…
…n
……
n
method of frequency spectrum representation is presented.</p>
      <p>The frequency spectrum for di erent texts shows regular behaviour for the
initial low frequencies, however, after frequency 10 the number of words for each
frequency becomes less stable as can be seen in Figure 2, which shows the
frequency spectrum for Charles Dickens' Oliver Twist in its original language. For
this reason, frequency values over 10 are not used for the purpose of feature
extraction. Notwithstanding these considerations, the words included in that
frequency range (over 10) are not entirely neglected as they feature as part of
the overall vocabulary and hence contribute to the classi cation process.</p>
      <p>The frequency spectrum for values of frequency between 1 and 10 is
regular (quasi linear) and hence suitable for a small number of points to represent
its behaviour. In order to reduce the dimensions of the feature set and given
the quasi linear behaviour of the data, a further simpli cation is performed and
groupings of 1, 2-4, and 5-10 are used. Each frequency range is represented by
a feature, obtaining 3 features to represent the frequency spectrum between 1
and 10 and a separate fourth feature that represents the vocabulary or di erent
unique words present in the text. Figure 2a shows the 3 features representation
of data for Charles Dickens' Oliver Twist in its original language English plotted
on top of the overall frequency spectrum.</p>
      <p>The feature representation of the frequency spectrum for values of frequency
between 1 and 10 holds for fragments and bags-of-words samples as shown on
Figure 2b. The sampling process allows for dispersion features to be calculated
providing a measurement of the homogeneity of the text.</p>
      <p>a. Total vocabulary
b. Fragment and BoW samples
104
10−1200
101 102</p>
      <p>Frequencies
103</p>
      <p>101
Frequencies
102</p>
      <p>The sampling process is repeated for a number, M , of sample sizes, i, and
the 8 features calculated for each size. This provides a variable number of nal
features depending on the number of sizes selected. The size of the resulting
set of features depends on M , the number of di erent sizes sampled. The total
number of features N is N = 8 M for FS and BS and N = 16 M for BFS.</p>
      <p>
        Datasets
In order to adjust the parameters of the proposed feature extraction method,
a multilingual corpus of literary works was compiled. Due to the cross-lingual
nature of the experiments, documents in di erent languages created by the same
author are required. Literary translation is believed to keep the markers from
the original author and the in uence of the translator is weak [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], therefore the
corpus used in the experiments is formed by original works by 8 authors and
translated novels from the same 8 authors. It includes two datasets: Dataset
1, a balanced dataset of original documents and Dataset 2 a unbalanced
extended version including translations. Dataset 1 contains 120 literary texts from
8 di erent authors (15 documents per author) in 4 di erent languages (English,
Spanish, German and French) as can be seen in Table 2. Dataset 2 includes all
documents from Dataset 1 plus 85 additional documents which are translations
of literary texts by some of the 8 authors from Dataset 1. A summary of the
translations in Dataset 2 can be found in Table 3. All documents were obtained
from the Gutenberg project website1.
      </p>
      <p>Language Author</p>
      <sec id="sec-3-1">
        <title>English</title>
        <p>English
French
French
German
German
Spanish
Spanish</p>
      </sec>
      <sec id="sec-3-2">
        <title>Charles Dickens</title>
        <p>Ryder Haggard
Alexander Dumas
Jules Verne
J. W. von Goethe
F. Gerstacker
V. Blasco Iban~ez
B. Perez Galdos</p>
        <p>Average document
length
144222
97913
139681
84124
67671
51655
100537
126034</p>
        <p>Estimating optimum parameter values
The rst parameter to be set is n the number of samples for each sample size
i that is necessary to obtain a representative gure for average and dispersion
values. An empirical study has been performed with 10 to 2000 samples of each
size, using a Random Forest classi er and leave one out cross validation. The
results of the classi cation using Fragments and bags-of-words for Dataset 1 are
shown on Figure 3.</p>
        <p>0</p>
        <p>The number of correctly classi ed documents increases as the number of
samples increases until a stable value is reached. Fragments and bags-of-words
behave di erently with more variation in the bags-of-words samples. Two
threshold levels can be identi ed in gure 3, the rst threshold is around the value of
200 samples, and the second threshold is around 700 samples where the results
are more stable. However, as the computational time is an important factor in
text analysis, the selected value for n, the number of samples, is xed at 200
samples per sample size i.
3.3</p>
        <p>Optimum sample size or combination of sample sizes. Number
of features.</p>
        <p>Once the number of samples is xed, we need to determine the sample sizes i
that will produce the best performing set of features. For each sample size, the
proposed method produces a set of 8 features. All sample sizes and their
combinations will be empirically tested to evaluate the e ect of di erent numbers
of features on the nal classi cation. For this experiment, the following sample
sizes (fragments and bags-of-words) have been used: 200, 500, 800, 1000, 1500,
2000, 3000 and 4000.</p>
        <p>Combinations of 1, 2, 3, 4, 5, 6, 7 and 8 di erent sample sizes were taken
for both fragment and bag-of-words samples, as well as the combination of both
types of samples. In order to optimize the number of features, the combination
that produces the highest accuracy with the lowest number of features will be
selected.The results, grouped per number of di erent sample sizes (M ) and hence
per total number of features, are shown in Figure 4. Figure 4 shows the results
for fragments, bags-of-words and the combination of both for Datasets 1 and 2.
a. Dataset 1</p>
        <p>b. Dataset 2
0.9
0.8</p>
        <p>The results from the di erent combinations of sample sizes show di erent
responses to Dataset 1 and Dataset 2. The di erent nature of these two datasets
explain the di erent behaviour of the type of samples for each dataset.
Fragments are more powerful at discriminating between originals in a balanced setting
whereas bags-of-words perform poorly when each author is represented by
documents in only one language. On the other hand, bags-of-words provide stronger
results for the more di cult problem presented in Dataset 2 where translations
are included in the dataset. In both scenarios, the combination of both types of
samples, BFS, provides the best results.</p>
        <p>In terms of the nal size of the feature set, which combines the type of sample
and the number of sample sizes i, there is no signi cant improvement after 2 sizes
are combined. The nal size of the feature set is therefore N = 2(8F + 8B) = 32.
A closer look at the combination of sizes that produce the best results show sizes
500 and 1000 obtaining the highest accuracy.</p>
        <p>
          Preliminary evaluation of BFS applied to CLAA using the same cross-validation
method (leave one novel out ) and the same dataset as Bogdanova and
Lazaridou [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] shows that BFS achieves better classi cation results (0.47) than high
level features without the use of machine translation (0.31). In this particular
experiment, 27 documents plus 7 which are translations of one of the 27 are
used, with the nal dataset being formed by 275 texts extracted from the 34
original documents. For this reason, leave one novel out is used to avoid the
classi er being trained on texts from the same document (or translations of it).
Every time leave one novel out is performed on this dataset, a large number of
texts are removed from the training data, hence the training set is small, which
added to the short length of the texts, a ects the overall classi cation
performance. Machine translation methods achieve better results but are limited by
the availability of resources in the given languages as well as the requirement to
identify the target language beforehand.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>
        This paper has presented a feature extraction method for language
independent vocabulary richness measurements. Traditional vocabulary richness
methods have not performed to state of the art accuracy values in the past and
have been replaced with monolingual features such as word n-grams and
partof-speech features. In order to work with multilingual corpora, previous research
has used machine translation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and lexicons or texts available in several
languages such as wikipedia [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or eurovoc documents [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The proposed method
expands traditional vocabulary richness using two types of samples: fragments
and bags-of-words of xed size. It calculates local measurements on those samples
as well as the dispersion of those measurements over the samples. The method
uses solely deep level intrinsic document measurements and hence no external
resources are used.
      </p>
      <p>Our experiments on cross-lingual authorship attribution show that BFS with
deep lexical features is suitable for discriminating between authors in
multilingual task using a relatively small feature set and no external resources. Even
though the accuracy of machine translation based methods is still signi cantly
higher, the experiments reproduced deal with highly popular languages such as
English and Spanish, and results for low resource languages are expected to be
lower. In these situations, a method based on intrinsic document features such
as the one presented in this paper, provides a solution that is not biased by
the amount of external resources available. Further work will focus rstly on
extensive evaluation of the performance of BFS at a variety of cross-lingual tasks
and secondly on the exploration of deep level features used in combination with
other language independent methods (implementation-wise) such as character
n-grams or methods based on punctuation and sentence length measurements.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Ahmed</given-names>
            <surname>Shamsul Are n</surname>
          </string-name>
          , Renato Vimieiro, Carlos Riveros, Hugh Craig, and
          <string-name>
            <given-names>Pablo</given-names>
            <surname>Moscato</surname>
          </string-name>
          .
          <article-title>An Information Theoretic Clustering Approach for Unveiling Authorship A nities in Shakespearean Era Plays and Poems</article-title>
          .
          <source>PLoS ONE</source>
          ,
          <volume>9</volume>
          (
          <issue>10</issue>
          ):e111445,
          <year>October 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Arun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Suresh</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Murty</surname>
            <given-names>Saradha</given-names>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. E. Veni</given-names>
            <surname>Madhavan</surname>
          </string-name>
          .
          <article-title>Stopwords and stylometry: a latent Dirichlet allocation approach</article-title>
          .
          <source>In NIPS workshop on Applications for Topic Models: Text and Beyond</source>
          , pages
          <fpage>1</fpage>
          <issue>{4</issue>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Dasha</given-names>
            <surname>Bogdanova</surname>
          </string-name>
          and
          <string-name>
            <given-names>Angeliki</given-names>
            <surname>Lazaridou</surname>
          </string-name>
          .
          <article-title>Cross-Language Authorship Attribution</article-title>
          .
          <source>In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014)</source>
          , Reykjavik, Iceland, May
          <volume>26</volume>
          -31,
          <year>2014</year>
          ., number May, pages
          <volume>83</volume>
          {
          <fpage>86</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>John</given-names>
            <surname>Burrows</surname>
          </string-name>
          , David Hoover, David Holmes,
          <string-name>
            <given-names>Joe</given-names>
            <surname>Rudman</surname>
          </string-name>
          , and
          <string-name>
            <surname>Fiona</surname>
          </string-name>
          J Tweedie.
          <source>The State of Non- Traditional Authorship Attribution Studies</source>
          <year>2010</year>
          :
          <article-title>Some Problems and Solutions</article-title>
          . Source, pages
          <fpage>1</fpage>
          <issue>{3</issue>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Covington</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. D.</given-names>
            <surname>McFall</surname>
          </string-name>
          .
          <article-title>Cutting the Gordian Knot: The MovingAverage Type-Token Ratio (MATTR)</article-title>
          .
          <source>Journal of Quantitative Linguistics</source>
          ,
          <volume>17</volume>
          (
          <issue>2</issue>
          ):
          <volume>94</volume>
          {
          <fpage>100</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Michael</given-names>
            <surname>Gamon</surname>
          </string-name>
          and
          <string-name>
            <given-names>Agnes</given-names>
            <surname>Grey</surname>
          </string-name>
          .
          <article-title>Linguistic correlates of style : authorship classi - cation with deep linguistic analysis features</article-title>
          .
          <source>Proceedings of the 20th International Conference on Computational Linguistics</source>
          ,
          <volume>4</volume>
          :
          <fpage>611</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>V.</given-names>
            <surname>Keselj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cercone</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Thomas</surname>
          </string-name>
          .
          <article-title>N-gram-based author pro les for authorship attribution</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>3</volume>
          :
          <fpage>255</fpage>
          {
          <fpage>264</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Miroslav</given-names>
            <surname>Kubat</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jir</given-names>
            <surname>Milicka</surname>
          </string-name>
          .
          <article-title>Vocabulary Richness Measure in Genres</article-title>
          .
          <source>Journal of Quantitative Linguistics</source>
          , pages
          <volume>339</volume>
          {
          <fpage>349</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Mari-Sanna</surname>
            <given-names>Paukkeri</given-names>
          </string-name>
          , Ilari T. Nieminen,
          <string-name>
            <given-names>P.</given-names>
            <surname>Matti</surname>
          </string-name>
          , Matti Polla, and
          <string-name>
            <given-names>Timo</given-names>
            <surname>Honkela</surname>
          </string-name>
          .
          <article-title>A Language-Independent Approach to Keyphrase Extraction and Evaluation</article-title>
          . In COLING (Posters),
          <source>number August</source>
          , pages
          <volume>83</volume>
          {
          <fpage>86</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Salvatore</surname>
            <given-names>Romeo</given-names>
          </string-name>
          , Dino Ienco, and
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Tagarelli</surname>
          </string-name>
          .
          <article-title>Knowledge-Based Representation for Transductive Multilingual Document Classi cation</article-title>
          .
          <source>ECIR</source>
          <year>2015</year>
          , a:
          <volume>92</volume>
          {
          <fpage>103</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>Jan</given-names>
            <surname>Rybicki</surname>
          </string-name>
          and
          <string-name>
            <given-names>Maciej</given-names>
            <surname>Eder</surname>
          </string-name>
          .
          <article-title>Deeper Delta across genres and languages: do we really need the most frequent words? Literary</article-title>
          and
          <string-name>
            <given-names>Linguistic</given-names>
            <surname>Computing</surname>
          </string-name>
          ,
          <volume>26</volume>
          (
          <issue>3</issue>
          ):
          <volume>315</volume>
          {
          <fpage>321</fpage>
          ,
          <year>September 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>Efstathios</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          .
          <article-title>A survey of modern authorship attribution methods</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          ,
          <volume>60</volume>
          (
          <issue>3</issue>
          ):
          <volume>538</volume>
          {
          <fpage>556</fpage>
          ,
          <string-name>
            <surname>March</surname>
          </string-name>
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Efstathios</surname>
            <given-names>Stamatatos</given-names>
          </string-name>
          , Nikos Fakotakis, and
          <string-name>
            <given-names>George</given-names>
            <surname>Kokkinakis</surname>
          </string-name>
          .
          <article-title>Automatic Text Categorization in Terms of Genre and Author</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>26</volume>
          :
          <fpage>471</fpage>
          {
          <fpage>495</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ralf</surname>
            <given-names>Steinberger</given-names>
          </string-name>
          , Bruno Pouliquen, and
          <string-name>
            <given-names>Johan</given-names>
            <surname>Hagman</surname>
          </string-name>
          .
          <article-title>Cross-lingual document similarity calculation using the multilingual thesaurus EUROVOC</article-title>
          .
          <source>In Computational Linguistics and Intelligent Text Processing</source>
          , Third International Conference, pages
          <volume>415</volume>
          {
          <fpage>424</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Lauren M. Stuart</surname>
          </string-name>
          , Saltanat Tazhibayeva,
          <string-name>
            <surname>Amy R. Wagoner</surname>
          </string-name>
          , and
          <string-name>
            <surname>Julia</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          .
          <article-title>Style features for authors in two languages</article-title>
          .
          <source>Proceedings - 2013 IEEE/WIC/ACM International Conference on Web Intelligence</source>
          ,
          <volume>1</volume>
          :
          <fpage>459</fpage>
          {
          <fpage>464</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>Lawrence</given-names>
            <surname>Venuti</surname>
          </string-name>
          .
          <article-title>The translator's invisibility: A history of translation</article-title>
          .
          <source>Routledge</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>