An Authorship Attribution for Serbian And̄elka Zečević Miloš Utvić Faculty of Mathematics Faculty of Philology Studentski trg 16 Studentski trg 3 Belgrade, Serbia Belgrade, Serbia andjelkaz@matf.bg.ac.rs misko@matf.bg.ac.rs ABSTRACT pacts on his or her writing in a unique and recognisable An authorship attribution is a problem of identifying the manner. Stylometry is a field that deals with defining and author of an anonymous or disputed text if there is a closed analysing relevant text features (so called style markers) that set of candidate authors. Due to the richness of natural can serve as an author’s fingerprint. So far, numerous text languages and numerous ways of expressing individuality in features have been considered [7, 19, 5, 8]. Some of them a writing process, this task employs all the sources of lan- exploit text surface and take into account an average word guage knowledge: lexis, syntax, semantics, orthography, etc. length or vocabulary richness while there are more complex The impressive results of n-gram based algorithms have been ones dealing with text semantics or syntax trees. This large presented in many papers for many languages so far. The set of features influences the choice of algorithms as well as goal of our research was to test if this group of algorithms methods for a text comparison. works equally well on Serbian and if it is a case, to cal- From machine learning point of view, an authorship attri- culate the optimal values for the parameters appearing in bution problem is considered as a classification task [17]: a the algorithms. Also, we wanted to test if a syllable based text of unknown authorship is assigned to one of the authors word decomposition, which represents a more human like from the given set of candidate authors. This treatment word decomposition in comparison to n-grams, can be use- put at researchers’ disposal all algorithms developed by the ful in an authorship attribution. Our results confirm good machine learning community (neural networks, support vec- performance of an n-gram based approach (accuracy up to tor machines, memory based learning algorithms, Bayesian 96%) and show the potential usefulness of a syllable based learning, etc.) and enables them to present their data and approach (accuracy from 81% to 89%). results in a mathematically well founded manner. The remainder of the paper is organized as follows. In Sec- tions 2 and 3 we introduce byte level n-grams and syllables Keywords as text features. In Section 4 we define two author pro- Authorship attribution, classification, n-grams, syllables files: first one is based on n-grams and the second is based on syllables. Distance measures for comparing profiles are introduced in Section 5. In section 6 we propose the struc- 1. INTRODUCTION ture of a profile based approach and discuss important steps By definition, an authorship attribution is a problem re- of the algorithm. Measures for estimating the effectiveness lated to identifying the author of an anonymous or dis- of the classification are presented in Section 7. Section 8 puted text if there is a closed set of candidate authors. One summarizes obtained results, and finally, Section 9 presents of the first studies concerning this topic was published in some conclusions and future directions. 1787 by Edmond Malone [12] who argued that Shakespeare did not write some parts of Henry VI. His evidences was 2. N-GRAMS based on the analyses of meter and rhyme and there was An n-gram is a continuous sequence of n bytes or n char- highly disagreement between Shakespeare’s and the real au- acters or n words of a longer portion of a text. Therefore, thor’s style. Probably the most influential study is done by we distinguish byte level, character level and word level n- Mosteller and Wallace [15] in 1964 on the authorship of The grams. Our focus is on byte level n-grams which representa- federalist papers, a series of 85 essays written by John Jay, tion depends on character encoding. For instance, if we con- Alexander Hamilton and James Madison on promotion of sider standard ASCII encoding and a portion of a text abc, the ratification of the United States Constitution. Nowa- all byte level 2-grams are 01100001 01100010 and 01100010 days, a focus of an authorship is put on modern text forms 01100011 where the code values 01100001, 01100010 and such as e-mail messages [10], SMS text messages [14], source 01100011 correspond to the characters a, b and c respec- codes [1] or blog posts [11]. tively. All the approaches to an authorship attribution problem The general strengths of a byte level n-gram approach are are based on the fact that the author’s individuality im- a language independent processing and a computational sim- plicity. Further more, for different values of a parameter n, BCI’12, September 16–20, 2012, Novi Sad, Serbia. n-grams afford tracking of lexical, contextual or formatting Copyright c 2012 by the paper’s authors. Copying permitted only for private and information. N-gram approaches are tolerant of noise too, academic purposes. This volume is published and copyrighted by its editors. Local Proceedings also appeared in ISBN 978-86-7031-200-5, Faculty of Sciences, and behave more robustly in presence of different kind of tex- University of Novi Sad. tual errors. On the other side, adjacent n-grams overlap and 109 contain redundant information so the memory requirements of disyllabic words and so on) and the average distance be- are more intensive in comparison to the other methods. If tween i-syllable words (i ≥ 1). In a later studies [7], it is a portion of a text is k bytes long, the number of byte level concluded that frequency distribution of syllables per word n-grams is k + 1 − n, so the total size of storing memory is discriminates different languages more than specific authors (k + 1 − n) · n. as well as that the overall distribution of syllable counts changes from one kind of writing to another. A profile based on syllables we used in our research con- 3. SYLLABLES sists of most frequent syllables in respect to their absolute A syllable is defined1 as a unit of pronunciation having one frequency. The form of the profile is vowel sound, with or without surrounding consonants, and forming all or part of a word. Decomposition of words into PA = {(s1 , F1 ), (s2 , F2 ), . . . (sM , FM )} syllables is not always easy and unique. Generally, every syllable requires a nucleus. Syllable nuclei in Serbian are where si denotes a syllable and Fi its absolute frequency vowels and sonorants like ‘r’, ‘l’ and ‘n’. Serbian syllables (the total number of its occurrences). A parameter M still can be open (if they end with a vowel) or closed (if they represents a profile size. end with a consonant). The boundary between subsequent The main motivation for the use of these profiles relies syllables in a word in Serbian is usually placed after a vowel. on the fact that for small values of the parameter n n-grams The rules of syllabication in Serbian are based on phonetic are able to represent syllable-like information. These profiles and semantic characteristics [16]. can be also observed as variable-length n-gram profiles and Although there are software packages and resources avail- used in cases when the optimal value of the parameter n is able for automatic syllabication of Serbian (RAS,2 Hunspell3 unknown. dictionaries and hyphen patterns for OpenOffice,4 ) in the first stage of our experiment we used a “naive” algorithm which sets syllable boundary after vowels and sonorant ‘r’. 5. DISTANCE MEASURES 5.1 N-gram Based Profiles 4. AN AUTHOR PROFILE The measure we used to compare the profile PAi of the To study an author’s style we require some operative rep- i-th author and the profile Pa of an anonymous or disputed resentation based on his or her writings. This representation text is defined by formula is called an author profile and consists of selected text fea- tures. The set of the features does not need to be homoge- X  2 · (fA (x) − fa (x)) 2 i neous, which means that numerous features can be combined d(PAi , Pa ) = x∈P fAi (x) + fa (x) in order to obtain qualitative representation able to capture a all inter-author style variations. On the other hand, the where x is a byte-level n-gram and fAi (x) and fa (x) are the set of features should be able to distinguish authors among relative frequencies of the n-gram x in the author’s profile themselves and should be something specific for a concrete and the profile of the text of unknown authorship respec- author. tively. This measure is originally proposed by Stamatatos [18] and represents the combination of measures proposed 4.1 N-gram Based Profiles by Keselj et al. [9] First author profile we used treats byte level n-grams as most relevant text features. It is defined as a set of pairs X  2 · (fA (x) − fa (x)) 2 i d(PAi , Pa ) = fAi (x) + fa (x) PA = {(x1 , f1 ), (x2 , f2 ), . . . (xM , fM )} x∈P ∪P a Ai where xi denotes an n-gram value and fi its relative fre- and Frantzeskou et al. [1] quency. The relative frequency is calculated as the total number of the n-gram occurrences divided by the total num- d(PAi , Pa ) = |PAi ∩ Pa | ber of n-grams. Pairs in the profile are ordered in respect to a relative frequency: from the highest to the lowest values. in order to improve measures’ tolerance to a class imbalance The number of pairs M is called a profile size and represents problem. The class imbalance problem [6] appears when at a very important parameter of n-gram based algorithms. least one profile is smaller or larger than the others. This This profile is originally proposed by Keselj et al. [9] and is a very realistic situation in author identification problems has been applied on many languages with great success. since there might be only a few text samples for one candi- date author and many more text samples for the other au- 4.2 Syllable Based Profiles thors, or vice versa. The measure proposed by Keselj et al. There is a number of papers authored or co-authored by [9] favours authors with shorter profiles because the union Wilhelm Fucks [2, 3, 4] on a syllables’ role in an author of the profiles is taken into account. On the other hand, the identification process. He considered an average number of measure proposed by Frantzeskou et al. [1] favours authors syllables per word, a word length frequency distribution in with longer profiles since the size of the intersection of two syllables (the number of monosyllabic words, the number profiles is considered. The presented measure is actually a pseudo measure be- 1 http://oxforddictionaries.com/ cause it leaks a symmetry property - the values PAi and Pa 2 http://www.rasprog.com/ cannot be switched. The results obtained in an experimen- 3 http://hunspell.sourceforge.net/ tal testing [18] are very promising and encourage researchers 4 http://ooo.matf.bg.ac.rs/dict-sr/ to manipulate with it in spite of its drawback. 110 5.2 Syllable Based Profiles Step 5: The author we treat as the writer of the unclassified For comparing syllable based profiles we used measure text is the one who’s index corresponds to the index proposed by Frantzeskou et al. [1] except we used syllables of the selected value. instead of n-grams. The measure In the background of the authorship attribution algorithm d(PAi , Pa ) = |PAi ∩ Pa | is a k Nearest Neighbour classification algorithm [13] with counts the total number of common syllables in the profile the parameter k set to 1. It represents memory based clas- PAi of i-th author and the profile Pa of an anonymous or sification algorithms and assigns an unclassified instance to disputed text. one of the given classes according to minimum-distance prin- ciple. 6. PROFILE BASED LEARNING 7. CLASSIFICATION EFFECTIVENESS The scheme of our algorithm is depicted in Figure 1 and represents a classical profile-based algorithm. For estimating the effectiveness [17] of a single class Ci classification we have used accuracy T Pi + T Ni Ai = T Pi + T Ni + F Pi + F Ni Values T Pi , T Ni , F Pi and F Ni are values from a confu- sion matrix (Table 1) and represent, respectively, the num- ber of yes-yes, no-no, yes-no and no-yes labeled instances. Table 1: A Confusion Matrix actual class class Ci yes no yes T Pi F Pi predicted class no F Ni T Ni For overall estimation of effectiveness we used macroaver- age of individual values: Pc i=1 Ai A= c where c denotes the total number of classes. The choice of a measure was strongly influenced by the current state-of-the-art results which are presented in re- spect to accuracy. In order to make our results comparable, we have chosen the same measure. 8. RESULTS We experimented with a set of newspapers articles5 writ- Figure 1: The algorithm scheme. ten independently by six authors. In order to achieve the au- thorship is the most important discriminatory feature among the authors, the selected articles meet a number of specific criteria. For the purpose of avoiding an author’s style change Step 1: The training data set consists of undisputed text over time, all articles per author are written in the same pe- samples of authors. All text samples per author are riod (within one year). To minimize the topic influence, we concatenated in one large text file and then the set of have only chosen articles that describe political situation in M most relevant n-grams or syllables is extracted to the country. All the texts (newspaper articles) are of the obtain the author profile. same genre, too. The number of articles per author and the Step 2: When a text of unknown authorship should be clas- total size of the training set is presented in Table 2. sified, the set of its M most relevant n-grams or sylla- The test set consists of non-overlapping articles and fol- bles is extracted. The values of the parameters M and lows the distribution of the training set. The number of n are the same as the values used in Step 1. articles per author and the total size of the test set is pre- sented in Table 3. Step 3: The profile of the text of unknown authorship is compared to the all authors’ profiles in respect to the 8.1 N-gram Based Profiles measures defined in the previous section. The tested values of the parameter n are in the inter- val from 1 to 10 and the tested values for the parameter Step 4: The obtained values are analysed by the system 5 and the smallest value is picked. http://www.danas.rs 111 11. REFERENCES Table 2: Authors in Training Set Author name number of train size [1] G. Frantzeskou, E. Stamatatos, S. Gritzalis, articles (in bytes) C. Chaski, and B. Howald. Identifying authorship by A1 Safeta Biševac 20 103,761 byte-level n-grams: The SCAP method. International A2 Zoran Panović 17 100,706 Journal of Digital Evidence, 6(1), 2007. A3 Aleksandar Roknić 27 101,809 [2] W. Fucks. On mathematical analysis of style. A4 Snežana Čongradin 28 102,756 Biometrika, (39):122–129, 1952. A5 Svetislav Basara 25 78,891 [3] W. Fucks. On nahordnung and fernordnung in samples A5 Miloš Vasić 18 102,875 of literary texts. Biometrika, (41):116–132, 1954. [4] W. Fucks and J. Lauter. Mathematische analyse des literarischen stils. Mathematik und Dichtung, pages Table 3: Authors in Test Set Author name number of test size 107–123, 1965. articles (in bytes) [5] J. Grieve. Quantitative authorship attribution: An A1 Safeta Biševac 10 82,945 evaluation of techniques. Literary and Linguistics A2 Zoran Panović 9 55,415 Computing, 22(3):251–270, 2007. A3 Aleksandar Roknić 13 50,193 [6] X. Guo, Y. Yin, C. Dong, G. Yang, and G. Zhou. On A4 Snežana Čongradin 14 56,558 the class imbalance problem. In Proc. of Fourth Int. A5 Svetislav Basara 12 64,684 Conf. on Natural Computation, pages 192–201, 2008. A5 Miloš Vasić 9 47,655 [7] D. Holmes. Authorship attribution. In Computer and the Humanities, volume 28, pages 87–106. 1994. [8] P. Juola. Authorship attribution. Foundation and M are 20, 100, 500, 1,000, 2,000, 3,000, 4,000 and 5,000. Trends in Information Retrieval, 1(3):233–334, 2006. The system achieves accuracy over 80% for all n-gram sizes greater then 3 and the profile sizes greater then 500. The [9] V. Keselj, F. Peng, N. Cercone, and C. Thomas. best achieved results are for the parameter n between 4 and 7 N-gram-based author profiles for authorship and for the profile size M between 1,000 and 4,000. The best attribution. In Proceedings of the Pacific Association achieved accuracy at all is 0.96 for n = 5 and M = 3, 000. for Computer Linguistics, pages 255–264, 2003. [10] M. Koppel and J. Schler. Exploiting stylistic 8.2 Syllable Based Profiles idiosyncrasies for authorship attribution. In The algorithm is tested for the parameter M with values Proceedings of IJCAI’03 Workshop on Computational from 100 to 1,200 by step 100. The values were limited by Approaches to Style Analysis and Synthesis, pages the maximal number of syllables per author. The results are 69–72, 2003. presented in Table 4 in respect to accuracy. [11] M. Koppel, J. Schler, S. Argamon, and E. Messeri. Authorship attribution with thousands of candidate authors. pages 659–660, 2006. Table 4: The Results for Syllable Based Profiles [12] E. Malone. A dissertation on parts one, two and three M accuracy M accuracy of Henry the Sixth tending to show that those playings 100 0.81 700 0.86 were not written originally by Shakespeare, 1787. 200 0.85 800 0.84 [13] T. Mitchell. Machine Learning. McGraw-Hill, 1997. 300 0.88 900 0.87 [14] A. Mohan, I. Baggili, and M. Rogers. Authorship 400 0.88 1,000 0.85 attribution of SMS messages using an n-grams 500 0.89 1,100 0.85 approach. Technical report, 2010. 600 0.89 1,200 0.86 [15] F. Mosteller and D. Wallace. Inference and disputed authorship: The Federalist. 1964. [16] M. Pešikan, J. Jerković, and M. Pižurica. Pravopis 9. CONCLUSIONS srpskoga jezika. Matica srpska, 2009. This paper presents some insights into an authorship at- [17] F. Sebastiani. Machine learning in automated text tribution problem for Serbian. The n-gram based approach categorization. ACM Computing Surveys, 34(1):1–47, proved its good performance and achieved accuracy from 2002. 80% up to 96% for the parameter 4 ≤ n ≤ 7, as well as [18] E. Stamatatos. Author identification using imbalanced the syllable based approach with accuracy between 81% and and limited training texts. In Proceedings of the 18th 89%. International Conference on Database and Expert In the future, both n-gram based and syllable approaches, Systems Applications, pages 237–241, 2007. combined with the wider set of measures, should be tested [19] E. Stamatatos. A survey of modern authorship on expanded corpora and longer list of authors. We also attribution methods. Journal of the American Society plan to improve a syllabication phase since the results of for Information Science and Technology, syllable based approach are promising. 60(3):538–556, 2009. 10. ACKNOWLEDGMENTS This research was supported by the Serbian Ministry of Education and Science under the grant 178006 (Serbian Lan- guage and its Resources). 112