<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Authorship Attribution for Serbian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>And¯elka Ze cˇevi c´</string-name>
          <email>andjelkaz@matf.bg.ac.rs</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>BCI'12, September 16-20, 2012, Novi Sad, Serbia.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miloš Utvi c´</string-name>
          <email>misko@matf.bg.ac.rs</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Copyright c 2012 by the paper's authors. Copying permitted only for private and</institution>
          ,
          <addr-line>academic purposes. This volume is published and copyrighted by its editors., Local Proceedings also appeared in ISBN 978-86-7031-200-5</addr-line>
          ,
          <institution>Faculty of Sciences, University of Novi Sad.</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Mathematics</institution>
          ,
          <addr-line>Studentski trg 16, Belgrade</addr-line>
          ,
          <country country="RS">Serbia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Faculty of Philology</institution>
          ,
          <addr-line>Studentski trg 3, Belgrade</addr-line>
          ,
          <country country="RS">Serbia</country>
        </aff>
      </contrib-group>
      <fpage>109</fpage>
      <lpage>112</lpage>
      <abstract>
        <p>An authorship attribution is a problem of identifying the author of an anonymous or disputed text if there is a closed set of candidate authors. Due to the richness of natural languages and numerous ways of expressing individuality in a writing process, this task employs all the sources of language knowledge: lexis, syntax, semantics, orthography, etc. The impressive results of n-gram based algorithms have been presented in many papers for many languages so far. The goal of our research was to test if this group of algorithms works equally well on Serbian and if it is a case, to calculate the optimal values for the parameters appearing in the algorithms. Also, we wanted to test if a syllable based word decomposition, which represents a more human like word decomposition in comparison to n-grams, can be useful in an authorship attribution. Our results confirm good performance of an n-gram based approach (accuracy up to 96%) and show the potential usefulness of a syllable based approach (accuracy from 81% to 89%).</p>
      </abstract>
      <kwd-group>
        <kwd>Authorship attribution</kwd>
        <kwd>classification</kwd>
        <kwd>n-grams</kwd>
        <kwd>syllables</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        By definition, an authorship attribution is a problem
related to identifying the author of an anonymous or
disputed text if there is a closed set of candidate authors. One
of the first studies concerning this topic was published in
1787 by Edmond Malone [
        <xref ref-type="bibr" rid="ref14">12</xref>
        ] who argued that Shakespeare
did not write some parts of Henry VI. His evidences was
based on the analyses of meter and rhyme and there was
highly disagreement between Shakespeare’s and the real
author’s style. Probably the most influential study is done by
Mosteller and Wallace [
        <xref ref-type="bibr" rid="ref17">15</xref>
        ] in 1964 on the authorship of The
federalist papers, a series of 85 essays written by John Jay,
Alexander Hamilton and James Madison on promotion of
the ratification of the United States Constitution.
Nowadays, a focus of an authorship is put on modern text forms
such as e-mail messages [
        <xref ref-type="bibr" rid="ref12">10</xref>
        ], SMS text messages [
        <xref ref-type="bibr" rid="ref16">14</xref>
        ], source
codes [
        <xref ref-type="bibr" rid="ref3">1</xref>
        ] or blog posts [
        <xref ref-type="bibr" rid="ref13">11</xref>
        ].
      </p>
      <p>
        All the approaches to an authorship attribution problem
are based on the fact that the author’s individuality
impacts on his or her writing in a unique and recognisable
manner. Stylometry is a field that deals with defining and
analysing relevant text features (so called style markers) that
can serve as an author’s fingerprint. So far, numerous text
features have been considered [
        <xref ref-type="bibr" rid="ref10 ref21 ref7 ref9">7, 19, 5, 8</xref>
        ]. Some of them
exploit text surface and take into account an average word
length or vocabulary richness while there are more complex
ones dealing with text semantics or syntax trees. This large
set of features influences the choice of algorithms as well as
methods for a text comparison.
      </p>
      <p>
        From machine learning point of view, an authorship
attribution problem is considered as a classification task [
        <xref ref-type="bibr" rid="ref19">17</xref>
        ]: a
text of unknown authorship is assigned to one of the authors
from the given set of candidate authors. This treatment
put at researchers’ disposal all algorithms developed by the
machine learning community (neural networks, support
vector machines, memory based learning algorithms, Bayesian
learning, etc.) and enables them to present their data and
results in a mathematically well founded manner.
      </p>
      <p>The remainder of the paper is organized as follows. In
Sections 2 and 3 we introduce byte level n-grams and syllables
as text features. In Section 4 we define two author
profiles: first one is based on n-grams and the second is based
on syllables. Distance measures for comparing profiles are
introduced in Section 5. In section 6 we propose the
structure of a profile based approach and discuss important steps
of the algorithm. Measures for estimating the effectiveness
of the classification are presented in Section 7. Section 8
summarizes obtained results, and finally, Section 9 presents
some conclusions and future directions.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>N-GRAMS</title>
      <p>An n-gram is a continuous sequence of n bytes or n
characters or n words of a longer portion of a text. Therefore,
we distinguish byte level, character level and word level
ngrams. Our focus is on byte level n-grams which
representation depends on character encoding. For instance, if we
consider standard ASCII encoding and a portion of a text abc,
all byte level 2-grams are 01100001 01100010 and 01100010
01100011 where the code values 01100001, 01100010 and
01100011 correspond to the characters a, b and c
respectively.</p>
      <p>The general strengths of a byte level n-gram approach are
a language independent processing and a computational
simplicity. Further more, for different values of a parameter n,
n-grams afford tracking of lexical, contextual or formatting
information. N-gram approaches are tolerant of noise too,
and behave more robustly in presence of different kind of
textual errors. On the other side, adjacent n-grams overlap and
contain redundant information so the memory requirements
are more intensive in comparison to the other methods. If
a portion of a text is k bytes long, the number of byte level
n-grams is k + 1 − n, so the total size of storing memory is
(k + 1 − n) · n.</p>
    </sec>
    <sec id="sec-3">
      <title>3. SYLLABLES</title>
      <p>
        A syllable is defined1 as a unit of pronunciation having one
vowel sound, with or without surrounding consonants, and
forming all or part of a word. Decomposition of words into
syllables is not always easy and unique. Generally, every
syllable requires a nucleus. Syllable nuclei in Serbian are
vowels and sonorants like ‘r’, ‘l’ and ‘n’. Serbian syllables
can be open (if they end with a vowel) or closed (if they
end with a consonant). The boundary between subsequent
syllables in a word in Serbian is usually placed after a vowel.
The rules of syllabication in Serbian are based on phonetic
and semantic characteristics [
        <xref ref-type="bibr" rid="ref18">16</xref>
        ].
      </p>
      <p>Although there are software packages and resources
available for automatic syllabication of Serbian (RAS,2 Hunspell3
dictionaries and hyphen patterns for OpenOffice,4) in the
first stage of our experiment we used a “naive” algorithm
which sets syllable boundary after vowels and sonorant ‘r’.
4.</p>
    </sec>
    <sec id="sec-4">
      <title>AN AUTHOR PROFILE</title>
      <p>To study an author’s style we require some operative
representation based on his or her writings. This representation
is called an author profile and consists of selected text
features. The set of the features does not need to be
homogeneous, which means that numerous features can be combined
in order to obtain qualitative representation able to capture
all inter-author style variations. On the other hand, the
set of features should be able to distinguish authors among
themselves and should be something specific for a concrete
author.
4.1</p>
    </sec>
    <sec id="sec-5">
      <title>N-gram Based Profiles</title>
      <p>First author profile we used treats byte level n-grams as
most relevant text features. It is defined as a set of pairs</p>
      <p>PA = {(x1, f1), (x2, f2), . . . (xM , fM )}
where xi denotes an n-gram value and fi its relative
frequency. The relative frequency is calculated as the total
number of the n-gram occurrences divided by the total
number of n-grams. Pairs in the profile are ordered in respect to
a relative frequency: from the highest to the lowest values.
The number of pairs M is called a profile size and represents
a very important parameter of n-gram based algorithms.</p>
      <p>
        This profile is originally proposed by Keselj et al. [
        <xref ref-type="bibr" rid="ref11">9</xref>
        ] and
has been applied on many languages with great success.
4.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>Syllable Based Profiles</title>
      <p>
        There is a number of papers authored or co-authored by
Wilhelm Fucks [
        <xref ref-type="bibr" rid="ref1 ref4 ref5 ref6">2, 3, 4</xref>
        ] on a syllables’ role in an author
identification process. He considered an average number of
syllables per word, a word length frequency distribution in
syllables (the number of monosyllabic words, the number
1http://oxforddictionaries.com/
2http://www.rasprog.com/
3http://hunspell.sourceforge.net/
4http://ooo.matf.bg.ac.rs/dict-sr/
of disyllabic words and so on) and the average distance
between i-syllable words (i ≥ 1). In a later studies [
        <xref ref-type="bibr" rid="ref9">7</xref>
        ], it is
concluded that frequency distribution of syllables per word
discriminates different languages more than specific authors
as well as that the overall distribution of syllable counts
changes from one kind of writing to another.
      </p>
      <p>A profile based on syllables we used in our research
consists of most frequent syllables in respect to their absolute
frequency. The form of the profile is</p>
      <p>PA = {(s1, F1), (s2, F2), . . . (sM , FM )}
where si denotes a syllable and Fi its absolute frequency
(the total number of its occurrences). A parameter M still
represents a profile size.</p>
      <p>The main motivation for the use of these profiles relies
on the fact that for small values of the parameter n n-grams
are able to represent syllable-like information. These profiles
can be also observed as variable-length n-gram profiles and
used in cases when the optimal value of the parameter n is
unknown.
5.
5.1</p>
    </sec>
    <sec id="sec-7">
      <title>DISTANCE MEASURES</title>
    </sec>
    <sec id="sec-8">
      <title>N-gram Based Profiles</title>
      <p>
        The measure we used to compare the profile PAi of the
i-th author and the profile Pa of an anonymous or disputed
text is defined by formula
d(PAi , Pa) = X
x∈Pa
2 · (fAi (x) − fa(x)) 2
fAi (x) + fa(x)
where x is a byte-level n-gram and fAi (x) and fa(x) are the
relative frequencies of the n-gram x in the author’s profile
and the profile of the text of unknown authorship
respectively. This measure is originally proposed by Stamatatos
[
        <xref ref-type="bibr" rid="ref20">18</xref>
        ] and represents the combination of measures proposed
by Keselj et al. [
        <xref ref-type="bibr" rid="ref11">9</xref>
        ]
d(PAi , Pa) =
      </p>
      <p>
        X
x∈Pa∪PAi
2 · (fAi (x) − fa(x)) 2
fAi (x) + fa(x)
and Frantzeskou et al. [
        <xref ref-type="bibr" rid="ref3">1</xref>
        ]
in order to improve measures’ tolerance to a class imbalance
problem. The class imbalance problem [
        <xref ref-type="bibr" rid="ref8">6</xref>
        ] appears when at
least one profile is smaller or larger than the others. This
is a very realistic situation in author identification problems
since there might be only a few text samples for one
candidate author and many more text samples for the other
authors, or vice versa. The measure proposed by Keselj et al.
[
        <xref ref-type="bibr" rid="ref11">9</xref>
        ] favours authors with shorter profiles because the union
of the profiles is taken into account. On the other hand, the
measure proposed by Frantzeskou et al. [
        <xref ref-type="bibr" rid="ref3">1</xref>
        ] favours authors
with longer profiles since the size of the intersection of two
profiles is considered.
      </p>
      <p>
        The presented measure is actually a pseudo measure
because it leaks a symmetry property - the values PAi and Pa
cannot be switched. The results obtained in an
experimental testing [
        <xref ref-type="bibr" rid="ref20">18</xref>
        ] are very promising and encourage researchers
to manipulate with it in spite of its drawback.
5.2
      </p>
      <p>
        For comparing syllable based profiles we used measure
proposed by Frantzeskou et al. [
        <xref ref-type="bibr" rid="ref3">1</xref>
        ] except we used syllables
instead of n-grams. The measure
counts the total number of common syllables in the profile
PAi of i-th author and the profile Pa of an anonymous or
disputed text.
      </p>
    </sec>
    <sec id="sec-9">
      <title>PROFILE BASED LEARNING</title>
      <p>The scheme of our algorithm is depicted in Figure 1 and
represents a classical profile-based algorithm.</p>
      <p>Step 1: The training data set consists of undisputed text
samples of authors. All text samples per author are
concatenated in one large text file and then the set of
M most relevant n-grams or syllables is extracted to
obtain the author profile.</p>
      <p>Step 2: When a text of unknown authorship should be
classified, the set of its M most relevant n-grams or
syllables is extracted. The values of the parameters M and
n are the same as the values used in Step 1.</p>
      <p>Step 3: The profile of the text of unknown authorship is
compared to the all authors’ profiles in respect to the
measures defined in the previous section.</p>
      <p>Step 4: The obtained values are analysed by the system
and the smallest value is picked.</p>
      <p>Step 5: The author we treat as the writer of the unclassified
text is the one who’s index corresponds to the index
of the selected value.</p>
      <p>
        In the background of the authorship attribution algorithm
is a k Nearest Neighbour classification algorithm [
        <xref ref-type="bibr" rid="ref15">13</xref>
        ] with
the parameter k set to 1. It represents memory based
classification algorithms and assigns an unclassified instance to
one of the given classes according to minimum-distance
principle.
7.
      </p>
    </sec>
    <sec id="sec-10">
      <title>CLASSIFICATION EFFECTIVENESS</title>
      <p>
        For estimating the effectiveness [
        <xref ref-type="bibr" rid="ref19">17</xref>
        ] of a single class Ci
classification we have used accuracy
      </p>
      <p>Ai =</p>
      <p>T Pi + T Ni</p>
      <p>T Pi + T Ni + F Pi + F Ni</p>
      <p>Values T Pi, T Ni, F Pi and F Ni are values from a
confusion matrix (Table 1) and represent, respectively, the
number of yes-yes, no-no, yes-no and no-yes labeled instances.
predicted class</p>
      <p>We experimented with a set of newspapers articles5
written independently by six authors. In order to achieve the
authorship is the most important discriminatory feature among
the authors, the selected articles meet a number of specific
criteria. For the purpose of avoiding an author’s style change
over time, all articles per author are written in the same
period (within one year). To minimize the topic influence, we
have only chosen articles that describe political situation in
the country. All the texts (newspaper articles) are of the
same genre, too. The number of articles per author and the
total size of the training set is presented in Table 2.</p>
      <p>The test set consists of non-overlapping articles and
follows the distribution of the training set. The number of
articles per author and the total size of the test set is
presented in Table 3.
8.1</p>
    </sec>
    <sec id="sec-11">
      <title>N-gram Based Profiles</title>
      <p>The tested values of the parameter n are in the
interval from 1 to 10 and the tested values for the parameter
5http://www.danas.rs</p>
    </sec>
    <sec id="sec-12">
      <title>Syllable Based Profiles</title>
      <p>The algorithm is tested for the parameter M with values
from 100 to 1,200 by step 100. The values were limited by
the maximal number of syllables per author. The results are
presented in Table 4 in respect to accuracy.</p>
    </sec>
    <sec id="sec-13">
      <title>CONCLUSIONS</title>
      <p>This paper presents some insights into an authorship
attribution problem for Serbian. The n-gram based approach
proved its good performance and achieved accuracy from
80% up to 96% for the parameter 4 ≤ n ≤ 7, as well as
the syllable based approach with accuracy between 81% and
89%.</p>
      <p>In the future, both n-gram based and syllable approaches,
combined with the wider set of measures, should be tested
on expanded corpora and longer list of authors. We also
plan to improve a syllabication phase since the results of
syllable based approach are promising.
10.</p>
    </sec>
    <sec id="sec-14">
      <title>ACKNOWLEDGMENTS</title>
      <p>This research was supported by the Serbian Ministry of
Education and Science under the grant 178006 (Serbian
Language and its Resources).</p>
    </sec>
    <sec id="sec-15">
      <title>REFERENCES</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>Table 2: Authors in Training Set Author name nuamrtbieclresof t(rinainbysteizse)</article-title>
          <source>Safeta Biˇsevac 20 103</source>
          ,761 Zoran Panovi´c 17 100,706 Aleksandar Rokni´c 27 101,809 Sneˇzana Cˇ ongradin 28
          <volume>102</volume>
          ,756 Svetislav Basara 25
          <volume>78</volume>
          ,891 Miloˇs Vasi´c 18 102,
          <issue>875 Table 3</issue>
          :
          <article-title>Authors in Test Set Author name nuamrtbieclresof (tienstbystiezse)</article-title>
          <source>Safeta Biˇsevac 10 82,945 Zoran Panovi´c 9 55</source>
          ,415 Aleksandar Rokni´c 13 50,193 Sneˇzana Cˇ ongradin 14
          <volume>56</volume>
          ,558 Svetislav Basara 12
          <volume>64</volume>
          ,684 Miloˇs Vasi´c 9
          <fpage>47</fpage>
          ,655 M are
          <volume>20</volume>
          ,
          <issue>100</issue>
          ,
          <issue>500</issue>
          ,
          <issue>1</issue>
          ,
          <issue>000</issue>
          ,
          <issue>2</issue>
          ,
          <issue>000</issue>
          ,
          <issue>3</issue>
          ,
          <issue>000</issue>
          ,
          <issue>4</issue>
          ,
          <issue>000</issue>
          and 5,
          <fpage>000</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>The system achieves accuracy over 80% for all n-gram sizes greater then 3 and the profile sizes greater then 500. The best achieved results are for the parameter n between 4 and 7 and for the profile size M between 1,000 and 4,000. The best achieved accuracy at all is 0.96 for n = 5 and M = 3</article-title>
          ,
          <fpage>000</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Frantzeskou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gritzalis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chaski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Howald</surname>
          </string-name>
          .
          <article-title>Identifying authorship by byte-level n-grams: The SCAP method</article-title>
          .
          <source>International Journal of Digital Evidence</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ),
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Fucks</surname>
          </string-name>
          .
          <source>On mathematical analysis of style. Biometrika</source>
          , (
          <volume>39</volume>
          ):
          <fpage>122</fpage>
          -
          <lpage>129</lpage>
          ,
          <year>1952</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Fucks</surname>
          </string-name>
          .
          <article-title>On nahordnung and fernordnung in samples of literary texts</article-title>
          . Biometrika, (
          <volume>41</volume>
          ):
          <fpage>116</fpage>
          -
          <lpage>132</lpage>
          ,
          <year>1954</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Fucks</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Lauter</surname>
          </string-name>
          .
          <article-title>Mathematische analyse des literarischen stils</article-title>
          .
          <source>Mathematik und Dichtung</source>
          , pages
          <fpage>107</fpage>
          -
          <lpage>123</lpage>
          ,
          <year>1965</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Grieve</surname>
          </string-name>
          .
          <article-title>Quantitative authorship attribution: An evaluation of techniques</article-title>
          .
          <source>Literary and Linguistics Computing</source>
          ,
          <volume>22</volume>
          (
          <issue>3</issue>
          ):
          <fpage>251</fpage>
          -
          <lpage>270</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>On the class imbalance problem</article-title>
          .
          <source>In Proc. of Fourth Int. Conf. on Natural Computation</source>
          , pages
          <fpage>192</fpage>
          -
          <lpage>201</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Holmes</surname>
          </string-name>
          .
          <article-title>Authorship attribution</article-title>
          .
          <source>In Computer and the Humanities</source>
          , volume
          <volume>28</volume>
          , pages
          <fpage>87</fpage>
          -
          <lpage>106</lpage>
          .
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Juola</surname>
          </string-name>
          .
          <article-title>Authorship attribution</article-title>
          .
          <source>Foundation and Trends in Information Retrieval</source>
          ,
          <volume>1</volume>
          (
          <issue>3</issue>
          ):
          <fpage>233</fpage>
          -
          <lpage>334</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Keselj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cercone</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Thomas</surname>
          </string-name>
          .
          <article-title>N-gram-based author profiles for authorship attribution</article-title>
          .
          <source>In Proceedings of the Pacific Association for Computer Linguistics</source>
          , pages
          <fpage>255</fpage>
          -
          <lpage>264</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Schler</surname>
          </string-name>
          .
          <article-title>Exploiting stylistic idiosyncrasies for authorship attribution</article-title>
          .
          <source>In Proceedings of IJCAI'03 Workshop on Computational Approaches to Style Analysis and Synthesis</source>
          , pages
          <fpage>69</fpage>
          -
          <lpage>72</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Argamon</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Messeri</surname>
          </string-name>
          .
          <article-title>Authorship attribution with thousands of candidate authors</article-title>
          . pages
          <fpage>659</fpage>
          -
          <lpage>660</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Malone</surname>
          </string-name>
          .
          <article-title>A dissertation on parts one, two and three of Henry the Sixth tending to show that those playings were not written originally by Shakespeare</article-title>
          ,
          <volume>1787</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          . Machine Learning.
          <source>McGraw-Hill</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Baggili</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Rogers</surname>
          </string-name>
          .
          <article-title>Authorship attribution of SMS messages using an n-grams approach</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Mosteller</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Wallace</surname>
          </string-name>
          .
          <article-title>Inference and disputed authorship: The Federalist</article-title>
          .
          <year>1964</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [16]
          <string-name>
            <surname>M. Peˇsikan</surname>
          </string-name>
          , J. Jerkovi´c, and M. Piˇzurica.
          <article-title>Pravopis srpskoga jezika</article-title>
          .
          <source>Matica srpska</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          .
          <source>Machine learning in automated text categorization. ACM Computing Surveys</source>
          ,
          <volume>34</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          .
          <article-title>Author identification using imbalanced and limited training texts</article-title>
          .
          <source>In Proceedings of the 18th International Conference on Database and Expert Systems Applications</source>
          , pages
          <fpage>237</fpage>
          -
          <lpage>241</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          .
          <article-title>A survey of modern authorship attribution methods</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          ,
          <volume>60</volume>
          (
          <issue>3</issue>
          ):
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>