An Authorship Attribution for Serbian

                               And̄elka Zečević                                                         Miloš Utvić
                            Faculty of Mathematics                                                     Faculty of Philology
                               Studentski trg 16                                                         Studentski trg 3
                               Belgrade, Serbia                                                         Belgrade, Serbia
                       andjelkaz@matf.bg.ac.rs                                                     misko@matf.bg.ac.rs


ABSTRACT                                                                                pacts on his or her writing in a unique and recognisable
An authorship attribution is a problem of identifying the                               manner. Stylometry is a field that deals with defining and
author of an anonymous or disputed text if there is a closed                            analysing relevant text features (so called style markers) that
set of candidate authors. Due to the richness of natural                                can serve as an author’s fingerprint. So far, numerous text
languages and numerous ways of expressing individuality in                              features have been considered [7, 19, 5, 8]. Some of them
a writing process, this task employs all the sources of lan-                            exploit text surface and take into account an average word
guage knowledge: lexis, syntax, semantics, orthography, etc.                            length or vocabulary richness while there are more complex
The impressive results of n-gram based algorithms have been                             ones dealing with text semantics or syntax trees. This large
presented in many papers for many languages so far. The                                 set of features influences the choice of algorithms as well as
goal of our research was to test if this group of algorithms                            methods for a text comparison.
works equally well on Serbian and if it is a case, to cal-                                 From machine learning point of view, an authorship attri-
culate the optimal values for the parameters appearing in                               bution problem is considered as a classification task [17]: a
the algorithms. Also, we wanted to test if a syllable based                             text of unknown authorship is assigned to one of the authors
word decomposition, which represents a more human like                                  from the given set of candidate authors. This treatment
word decomposition in comparison to n-grams, can be use-                                put at researchers’ disposal all algorithms developed by the
ful in an authorship attribution. Our results confirm good                              machine learning community (neural networks, support vec-
performance of an n-gram based approach (accuracy up to                                 tor machines, memory based learning algorithms, Bayesian
96%) and show the potential usefulness of a syllable based                              learning, etc.) and enables them to present their data and
approach (accuracy from 81% to 89%).                                                    results in a mathematically well founded manner.
                                                                                           The remainder of the paper is organized as follows. In Sec-
                                                                                        tions 2 and 3 we introduce byte level n-grams and syllables
Keywords                                                                                as text features. In Section 4 we define two author pro-
Authorship attribution, classification, n-grams, syllables                              files: first one is based on n-grams and the second is based
                                                                                        on syllables. Distance measures for comparing profiles are
                                                                                        introduced in Section 5. In section 6 we propose the struc-
1.    INTRODUCTION                                                                      ture of a profile based approach and discuss important steps
   By definition, an authorship attribution is a problem re-                            of the algorithm. Measures for estimating the effectiveness
lated to identifying the author of an anonymous or dis-                                 of the classification are presented in Section 7. Section 8
puted text if there is a closed set of candidate authors. One                           summarizes obtained results, and finally, Section 9 presents
of the first studies concerning this topic was published in                             some conclusions and future directions.
1787 by Edmond Malone [12] who argued that Shakespeare
did not write some parts of Henry VI. His evidences was                                 2. N-GRAMS
based on the analyses of meter and rhyme and there was                                     An n-gram is a continuous sequence of n bytes or n char-
highly disagreement between Shakespeare’s and the real au-                              acters or n words of a longer portion of a text. Therefore,
thor’s style. Probably the most influential study is done by                            we distinguish byte level, character level and word level n-
Mosteller and Wallace [15] in 1964 on the authorship of The                             grams. Our focus is on byte level n-grams which representa-
federalist papers, a series of 85 essays written by John Jay,                           tion depends on character encoding. For instance, if we con-
Alexander Hamilton and James Madison on promotion of                                    sider standard ASCII encoding and a portion of a text abc,
the ratification of the United States Constitution. Nowa-                               all byte level 2-grams are 01100001 01100010 and 01100010
days, a focus of an authorship is put on modern text forms                              01100011 where the code values 01100001, 01100010 and
such as e-mail messages [10], SMS text messages [14], source                            01100011 correspond to the characters a, b and c respec-
codes [1] or blog posts [11].                                                           tively.
   All the approaches to an authorship attribution problem                                 The general strengths of a byte level n-gram approach are
are based on the fact that the author’s individuality im-                               a language independent processing and a computational sim-
                                                                                        plicity. Further more, for different values of a parameter n,
BCI’12, September 16–20, 2012, Novi Sad, Serbia.                                        n-grams afford tracking of lexical, contextual or formatting
Copyright c 2012 by the paper’s authors. Copying permitted only for private and
                                                                                        information. N-gram approaches are tolerant of noise too,
academic purposes. This volume is published and copyrighted by its editors.
Local Proceedings also appeared in ISBN 978-86-7031-200-5, Faculty of Sciences,
                                                                                        and behave more robustly in presence of different kind of tex-
University of Novi Sad.                                                                 tual errors. On the other side, adjacent n-grams overlap and


                                                                                  109
contain redundant information so the memory requirements                of disyllabic words and so on) and the average distance be-
are more intensive in comparison to the other methods. If               tween i-syllable words (i ≥ 1). In a later studies [7], it is
a portion of a text is k bytes long, the number of byte level           concluded that frequency distribution of syllables per word
n-grams is k + 1 − n, so the total size of storing memory is            discriminates different languages more than specific authors
(k + 1 − n) · n.                                                        as well as that the overall distribution of syllable counts
                                                                        changes from one kind of writing to another.
                                                                           A profile based on syllables we used in our research con-
3.   SYLLABLES                                                          sists of most frequent syllables in respect to their absolute
   A syllable is defined1 as a unit of pronunciation having one         frequency. The form of the profile is
vowel sound, with or without surrounding consonants, and
forming all or part of a word. Decomposition of words into                         PA = {(s1 , F1 ), (s2 , F2 ), . . . (sM , FM )}
syllables is not always easy and unique. Generally, every
syllable requires a nucleus. Syllable nuclei in Serbian are             where si denotes a syllable and Fi its absolute frequency
vowels and sonorants like ‘r’, ‘l’ and ‘n’. Serbian syllables           (the total number of its occurrences). A parameter M still
can be open (if they end with a vowel) or closed (if they               represents a profile size.
end with a consonant). The boundary between subsequent                    The main motivation for the use of these profiles relies
syllables in a word in Serbian is usually placed after a vowel.         on the fact that for small values of the parameter n n-grams
The rules of syllabication in Serbian are based on phonetic             are able to represent syllable-like information. These profiles
and semantic characteristics [16].                                      can be also observed as variable-length n-gram profiles and
   Although there are software packages and resources avail-            used in cases when the optimal value of the parameter n is
able for automatic syllabication of Serbian (RAS,2 Hunspell3            unknown.
dictionaries and hyphen patterns for OpenOffice,4 ) in the
first stage of our experiment we used a “naive” algorithm
which sets syllable boundary after vowels and sonorant ‘r’.
                                                                        5. DISTANCE MEASURES
                                                                        5.1 N-gram Based Profiles
4.   AN AUTHOR PROFILE                                                     The measure we used to compare the profile PAi of the
   To study an author’s style we require some operative rep-            i-th author and the profile Pa of an anonymous or disputed
resentation based on his or her writings. This representation           text is defined by formula
is called an author profile and consists of selected text fea-
tures. The set of the features does not need to be homoge-                                       X  2 · (fA (x) − fa (x)) 2
                                                                                                             i
neous, which means that numerous features can be combined                        d(PAi , Pa ) =
                                                                                                x∈P
                                                                                                        fAi (x) + fa (x)
in order to obtain qualitative representation able to capture                                         a

all inter-author style variations. On the other hand, the               where x is a byte-level n-gram and fAi (x) and fa (x) are the
set of features should be able to distinguish authors among             relative frequencies of the n-gram x in the author’s profile
themselves and should be something specific for a concrete              and the profile of the text of unknown authorship respec-
author.                                                                 tively. This measure is originally proposed by Stamatatos
                                                                        [18] and represents the combination of measures proposed
4.1 N-gram Based Profiles                                               by Keselj et al. [9]
 First author profile we used treats byte level n-grams as
most relevant text features. It is defined as a set of pairs                                     X  2 · (fA (x) − fa (x)) 2
                                                                                                               i
                                                                               d(PAi , Pa ) =
                                                                                                          fAi (x) + fa (x)
           PA = {(x1 , f1 ), (x2 , f2 ), . . . (xM , fM )}                                    x∈P ∪P
                                                                                                  a       Ai


where xi denotes an n-gram value and fi its relative fre-               and Frantzeskou et al. [1]
quency. The relative frequency is calculated as the total
number of the n-gram occurrences divided by the total num-                                 d(PAi , Pa ) = |PAi ∩ Pa |
ber of n-grams. Pairs in the profile are ordered in respect to
a relative frequency: from the highest to the lowest values.            in order to improve measures’ tolerance to a class imbalance
The number of pairs M is called a profile size and represents           problem. The class imbalance problem [6] appears when at
a very important parameter of n-gram based algorithms.                  least one profile is smaller or larger than the others. This
  This profile is originally proposed by Keselj et al. [9] and          is a very realistic situation in author identification problems
has been applied on many languages with great success.                  since there might be only a few text samples for one candi-
                                                                        date author and many more text samples for the other au-
4.2 Syllable Based Profiles                                             thors, or vice versa. The measure proposed by Keselj et al.
  There is a number of papers authored or co-authored by                [9] favours authors with shorter profiles because the union
Wilhelm Fucks [2, 3, 4] on a syllables’ role in an author               of the profiles is taken into account. On the other hand, the
identification process. He considered an average number of              measure proposed by Frantzeskou et al. [1] favours authors
syllables per word, a word length frequency distribution in             with longer profiles since the size of the intersection of two
syllables (the number of monosyllabic words, the number                 profiles is considered.
                                                                           The presented measure is actually a pseudo measure be-
1
  http://oxforddictionaries.com/                                        cause it leaks a symmetry property - the values PAi and Pa
2
  http://www.rasprog.com/                                               cannot be switched. The results obtained in an experimen-
3
  http://hunspell.sourceforge.net/                                      tal testing [18] are very promising and encourage researchers
4
  http://ooo.matf.bg.ac.rs/dict-sr/                                     to manipulate with it in spite of its drawback.


                                                                  110
5.2 Syllable Based Profiles                                          Step 5: The author we treat as the writer of the unclassified
  For comparing syllable based profiles we used measure                  text is the one who’s index corresponds to the index
proposed by Frantzeskou et al. [1] except we used syllables              of the selected value.
instead of n-grams. The measure
                                                                        In the background of the authorship attribution algorithm
                  d(PAi , Pa ) = |PAi ∩ Pa |                         is a k Nearest Neighbour classification algorithm [13] with
counts the total number of common syllables in the profile           the parameter k set to 1. It represents memory based clas-
PAi of i-th author and the profile Pa of an anonymous or             sification algorithms and assigns an unclassified instance to
disputed text.                                                       one of the given classes according to minimum-distance prin-
                                                                     ciple.
6.   PROFILE BASED LEARNING
                                                                     7. CLASSIFICATION EFFECTIVENESS
  The scheme of our algorithm is depicted in Figure 1 and
represents a classical profile-based algorithm.                         For estimating the effectiveness [17] of a single class Ci
                                                                     classification we have used accuracy
                                                                                                  T Pi + T Ni
                                                                                    Ai =
                                                                                           T Pi + T Ni + F Pi + F Ni
                                                                        Values T Pi , T Ni , F Pi and F Ni are values from a confu-
                                                                     sion matrix (Table 1) and represent, respectively, the num-
                                                                     ber of yes-yes, no-no, yes-no and no-yes labeled instances.


                                                                                   Table 1: A Confusion Matrix
                                                                                                      actual class
                                                                                        class Ci
                                                                                                       yes    no
                                                                                                  yes T Pi F Pi
                                                                                  predicted class
                                                                                                  no  F Ni T Ni

                                                                       For overall estimation of effectiveness we used macroaver-
                                                                     age of individual values:
                                                                                                  Pc
                                                                                                     i=1 Ai
                                                                                            A=
                                                                                                      c
                                                                     where c denotes the total number of classes.
                                                                       The choice of a measure was strongly influenced by the
                                                                     current state-of-the-art results which are presented in re-
                                                                     spect to accuracy. In order to make our results comparable,
                                                                     we have chosen the same measure.


                                                                     8. RESULTS
                                                                        We experimented with a set of newspapers articles5 writ-
          Figure 1: The algorithm scheme.                            ten independently by six authors. In order to achieve the au-
                                                                     thorship is the most important discriminatory feature among
                                                                     the authors, the selected articles meet a number of specific
                                                                     criteria. For the purpose of avoiding an author’s style change
Step 1: The training data set consists of undisputed text            over time, all articles per author are written in the same pe-
    samples of authors. All text samples per author are              riod (within one year). To minimize the topic influence, we
    concatenated in one large text file and then the set of          have only chosen articles that describe political situation in
    M most relevant n-grams or syllables is extracted to             the country. All the texts (newspaper articles) are of the
    obtain the author profile.                                       same genre, too. The number of articles per author and the
Step 2: When a text of unknown authorship should be clas-            total size of the training set is presented in Table 2.
    sified, the set of its M most relevant n-grams or sylla-            The test set consists of non-overlapping articles and fol-
    bles is extracted. The values of the parameters M and            lows the distribution of the training set. The number of
    n are the same as the values used in Step 1.                     articles per author and the total size of the test set is pre-
                                                                     sented in Table 3.
Step 3: The profile of the text of unknown authorship is
    compared to the all authors’ profiles in respect to the          8.1 N-gram Based Profiles
    measures defined in the previous section.                          The tested values of the parameter n are in the inter-
                                                                     val from 1 to 10 and the tested values for the parameter
Step 4: The obtained values are analysed by the system
                                                                     5
    and the smallest value is picked.                                    http://www.danas.rs


                                                               111
                                                                      11. REFERENCES
          Table 2: Authors in Training Set
            Author name       number of train size                     [1] G. Frantzeskou, E. Stamatatos, S. Gritzalis,
                               articles (in bytes)
                                                                           C. Chaski, and B. Howald. Identifying authorship by
     A1     Safeta Biševac      20      103,761                           byte-level n-grams: The SCAP method. International
     A2     Zoran Panović       17      100,706                           Journal of Digital Evidence, 6(1), 2007.
     A3   Aleksandar Roknić     27      101,809                       [2] W. Fucks. On mathematical analysis of style.
     A4   Snežana Čongradin    28      102,756                           Biometrika, (39):122–129, 1952.
     A5    Svetislav Basara      25       78,891                       [3] W. Fucks. On nahordnung and fernordnung in samples
     A5       Miloš Vasić      18      102,875                           of literary texts. Biometrika, (41):116–132, 1954.
                                                                       [4] W. Fucks and J. Lauter. Mathematische analyse des
                                                                           literarischen stils. Mathematik und Dichtung, pages
            Table 3: Authors in Test Set
            Author name       number of  test size                         107–123, 1965.
                               articles (in bytes)
                                                                       [5] J. Grieve. Quantitative authorship attribution: An
     A1     Safeta Biševac      10       82,945
                                                                           evaluation of techniques. Literary and Linguistics
     A2     Zoran Panović        9       55,415
                                                                           Computing, 22(3):251–270, 2007.
     A3   Aleksandar Roknić     13       50,193
                                                                       [6] X. Guo, Y. Yin, C. Dong, G. Yang, and G. Zhou. On
     A4   Snežana Čongradin    14       56,558                           the class imbalance problem. In Proc. of Fourth Int.
     A5    Svetislav Basara      12       64,684                           Conf. on Natural Computation, pages 192–201, 2008.
     A5       Miloš Vasić       9       47,655                       [7] D. Holmes. Authorship attribution. In Computer and
                                                                           the Humanities, volume 28, pages 87–106. 1994.
                                                                       [8] P. Juola. Authorship attribution. Foundation and
M are 20, 100, 500, 1,000, 2,000, 3,000, 4,000 and 5,000.
                                                                           Trends in Information Retrieval, 1(3):233–334, 2006.
The system achieves accuracy over 80% for all n-gram sizes
greater then 3 and the profile sizes greater then 500. The             [9] V. Keselj, F. Peng, N. Cercone, and C. Thomas.
best achieved results are for the parameter n between 4 and 7              N-gram-based author profiles for authorship
and for the profile size M between 1,000 and 4,000. The best               attribution. In Proceedings of the Pacific Association
achieved accuracy at all is 0.96 for n = 5 and M = 3, 000.                 for Computer Linguistics, pages 255–264, 2003.
                                                                      [10] M. Koppel and J. Schler. Exploiting stylistic
8.2 Syllable Based Profiles                                                idiosyncrasies for authorship attribution. In
   The algorithm is tested for the parameter M with values                 Proceedings of IJCAI’03 Workshop on Computational
from 100 to 1,200 by step 100. The values were limited by                  Approaches to Style Analysis and Synthesis, pages
the maximal number of syllables per author. The results are                69–72, 2003.
presented in Table 4 in respect to accuracy.                          [11] M. Koppel, J. Schler, S. Argamon, and E. Messeri.
                                                                           Authorship attribution with thousands of candidate
                                                                           authors. pages 659–660, 2006.
  Table 4: The Results for Syllable Based Profiles                    [12] E. Malone. A dissertation on parts one, two and three
           M   accuracy    M     accuracy                                  of Henry the Sixth tending to show that those playings
           100   0.81      700     0.86                                    were not written originally by Shakespeare, 1787.
           200   0.85      800     0.84                               [13] T. Mitchell. Machine Learning. McGraw-Hill, 1997.
           300   0.88      900     0.87                               [14] A. Mohan, I. Baggili, and M. Rogers. Authorship
           400   0.88     1,000    0.85                                    attribution of SMS messages using an n-grams
           500   0.89     1,100    0.85                                    approach. Technical report, 2010.
           600   0.89     1,200    0.86                               [15] F. Mosteller and D. Wallace. Inference and disputed
                                                                           authorship: The Federalist. 1964.
                                                                      [16] M. Pešikan, J. Jerković, and M. Pižurica. Pravopis
9.   CONCLUSIONS                                                           srpskoga jezika. Matica srpska, 2009.
   This paper presents some insights into an authorship at-           [17] F. Sebastiani. Machine learning in automated text
tribution problem for Serbian. The n-gram based approach                   categorization. ACM Computing Surveys, 34(1):1–47,
proved its good performance and achieved accuracy from                     2002.
80% up to 96% for the parameter 4 ≤ n ≤ 7, as well as                 [18] E. Stamatatos. Author identification using imbalanced
the syllable based approach with accuracy between 81% and                  and limited training texts. In Proceedings of the 18th
89%.                                                                       International Conference on Database and Expert
   In the future, both n-gram based and syllable approaches,               Systems Applications, pages 237–241, 2007.
combined with the wider set of measures, should be tested             [19] E. Stamatatos. A survey of modern authorship
on expanded corpora and longer list of authors. We also                    attribution methods. Journal of the American Society
plan to improve a syllabication phase since the results of                 for Information Science and Technology,
syllable based approach are promising.                                     60(3):538–556, 2009.

10. ACKNOWLEDGMENTS
  This research was supported by the Serbian Ministry of
Education and Science under the grant 178006 (Serbian Lan-
guage and its Resources).


                                                                112